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PREFACE 


о 


This text was prepared for the affective component of a graduate level 
course in affective and cognitive instrument development. The techniques 
described, and the data sets included, represent attempts over several years 
to prepare materials that would illustrate proper instrument development 
techniques in the affective domain. The need for this text became apparent 
after witnessing several large-scale research projects that were hindered by 
inadequately prepared instruments. Researchers in these projects were 
often not aware of the need to use appropriate procedures in developing 
affective instruments; furthermore, they could rarely, if ever, locate a com- 
prehensive and readable text that could help. This text was developed to 


meet this important instructional need. 
Chapter 1 discusses the importance of affective variables and presents 


conceptual definitions of major affective constructs such as attitudes, self- 
concept, interests, and values. Chapter 2 outlines and illustrates the domain- 
referenced approach for developing operational definitions (i.e., items) 
for the targeted conceptual definitions. Chapter 3 addresses the important 
area of scaling the affective characteristics in the context of Fishbein’s 
expectancy-value model. The Thurstone, latent-trait, Likert, and semantic 
differential techniques are included along with a section on normative versus 
ipsative measures. 

Chapters 4 and 5 present the underlying theory and appropriate empirical 
techniques appropriate for examining validity and reliability evidence. Data 
gathered by the author, using several different instruments, are included to 
illustrate each technique. Annotated 5Р55Х computer output for the Factor 
and Reliability programs is included for examining construct validity using 
factor analytic techniques and alpha internal consistency reliability. Actual 
item analysis and reliability data from the SPSS* Reliability program are 


xi 
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also presented. Decision strategies are discussed and models for reporting 
the data analysis are illustrated. Readers should find these sections quite 
useful from an instructional viewpoint. For those interested in further study, 
several journal articles illustrating the techniques described are referenced 
at the end of each chapter. Finally, Chapter 6 reviews the steps in the process 
of instrument development. 

In the preparation of any text, several people play an important role and 
deserve to be acknowledged. First, I would like to acknowledge my mentor 
and friend during my graduate years at the State University of New York at 
Albany, Robert Pruzek. He not only taught me the necessary content but 
also how to conceptualize, conduct, and report my research. During these 
same years the support and probing mind of Robert McMorris will never be 
forgotten. I would like to thank the University of Connecticut for granting 
me a sabbatic leave to work on the text, as well as the many graduate 
students whose penetrating questions have resulted in much rethinking, 
reworking, and I hcpe, clearer explanations. | 

Particular thanks is due to Barbara Helms, for her dedicated assistance ІП 
data analysis, preparation of tables, and editing of the manuscript. Several 
graduate students at the Bureau of Educational Research and Service also 
helped. Steve Melnick produced the references and developed the index, 
and Chris Murphy and Bob Garber proofread the final text. I should also 
note that my fall 1984 graduate class in instrument development read and 
critiqued a draft of the text. The comments of Marcy Delcourt and Gina 
Schack were most helpful. Finally, special thanks is extended to Marion 
Lapierre who typed several early drafts of the text and Gail Millerd for her 
work on parts of the final document. 
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1 АЕҒЕСТІУЕ CHARACTERISTICS: 
THEIR CONCEPTUAL DEFINITIONS 


Affective Characteristics and School Learning 


During the 1960s the cognitive domain continued to receive much attention 
as proponents and critics argued the merits and evils of behavioral objec- 
tives, often utilizing the framework for the cognitive domain suggested in 
the Taxonomy of Educational Objectives, Handbook 1 (Bloom, 1956). In 
1964 the Taxonomy of Educational Objectives, Handbook II: The Affective 
Domain (Krathwohl et al., 1964) was published but received little attention 
in light of the debate over behavior objectives and the apparent national 
consensus that the primary aims of schooling were in the cognitive domain. 
< Assessing Educational Achievement in the Affective Do- 
1973) discussed the growing awareness in the late 1960s and 
ols to attend to the affective domain when 


In his article 
main," Tyler ( 
early 1970s of the need for scho atte 
developing their learning goals and objectives. — f 

Tyler suggested two prevalent views as explainations of why affective 
learning was not systematically planned as part of most school curricula. 
First, many educators felt that affective concerns such as “feelings” were not 
the business of the school but rather the task of the home or church. The 
second view was that affective concerns were natural outgrowths (ends) of 


1 


2 INSTRUMENT DEVELOPMENT IN THE AFFECTIVE DOMAIN 


learning cognitive content and need not be included as separate objectives 
(means) to be addressed during the learning process. Fortunately, during 
the 1970s, affective objectives were recognized to be important as both ends 
and means in the overall school process, and were no longer considered as 
merely acceptable outgrowths of an emphasis on the cognitive domain. Asa 
result, state-level as well as school- and program-level statements of goals 
and objectives included both cognitive and affective objectives. 

The most recent emphasis on the cognitive domain surfaced mainly as a 
result of the continual decline in standardized test scores in the late 1970s 
and early 1980s. Calls for increased emphasis in the cognitive area rang out 
loudly with publication of the report by the National Commission on Excel- 
lence in Education entitled “А Nation at Risk: The Imperative for Edu- 
cational Reform” (Bell, 1983) and the report of the Carnegie Foundation 
for the Advancement of Teaching entitled “High School: А Report on 
Secondary Education in America” (Boyer, 1983). While the cognitive 
domain receives increased attention, the affective area will remain firmly en- 
trenched as an important aspect of the schooling process as well as an 
outcome of schooling. 

Bloom’s (1976) model of school learning depicted in figure 1-1 clearly 
suggests that during instruction students approach any learning task with 
prior affective entry characteristics (e.g., attitudes, self-esteem, interests, 
and values), as well as cognitive behaviors. It is the dynamic interaction 
between these overlapping cognitive and affective domains during the in- 
structional process that results in both cognitive learning outcomes and 
associated affective outcomes. These affective outcomes help guide future 
feelings about course content and issues (attitudes), feelings of personal 


STUDENT 


INSTRUCTION 
CHARACTERISTICS 


LEARNING 
OUTCOMES 
Cognitive Entry _ 

Behaviors Level and Type of 


Achievement 
LEARNING 


TASK(S) 


> Rate of Learnin 
Affective Entry ____--- x 


Characteristics 


> Affective Outcomes 


Quality of 
Instruction 
Figure 1-1. 2 variables in the theory of school learning (from Bloom, 1976, 
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2 e 
Negative Neutral Positive 


+ 
Students A B c D E 


Figure 1-2. Direction and intensity of student attitudes toward microcomputers 
(Target) (Adapted from Anderson, 1981, p. 4). 


worth and success (self-esteem), desires to become involved in various 
activities (interests), and personal standards (values). 

Researchers and program evaluators are often faced with the task of 
assessing student affective characteristics at the beginning and end of an 
educational program. Counselors frequently seek to assess student affective 
characteristics to assist in the process of dealing with personal growth and 
ese situations it is necessary to employ affective 
are both theoretically based and psychometri- 
this book is to assist in the selection and 


vocational decisions. In th 
measuring instruments that 
cally sound. The purpose of 
development of such instruments. 


What are Affective Characteristics? 


Anderson's book entitled Assessing Affective Characteristics in the Schools 
(1981) presents an in-depth theoretical and practical discussion of affective 
instrument construction. The author is indebted to Anderson for providing a 
clear perspective on the conceptual and operational definition of affective 
Characteristics. 

As described by Anderson, human characteristics reflect typical ways of 
thinking, acting, and feeling in diverse situations (Anderson, 1981, p. 3). 
While the first two areas reflect cognitive and behavioral characteristics, the 
third area reflects affective characteristics, which Anderson describes as 
"qualities which present people's typical ways of feeling or expressing their 
emotions." 

Anderson states that all affective characteristics must have three attri- 
butes: intensity, direction, and target. The intensity attribute refers to the 


degree or strength of the feeling. For example, a student's attitude toward 
working with microcomputers could be very strong, whereas another 
student's could be quite mild. The direction attribute reflects the positive, 
neutral, or negative aspect of the feeling. The final attribute. the target, 
identifies the object, behavior, or idea at which the feeling is being directed. 

Figure 1-2 illustrates intensity, direction, and target attributes. Using 


again our microcomputer example. we see that the target of the affect is 
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"working with a microcomputer.” A hypothetical rating scale has been used 
to measure and locate the attitudes of five students toward microcomputers 
on this continuum which specifies negative and positive feelings (direction) 
as well as neutral (student C) and quite intense feelings (students A and E). 


Types of Affective Characteristics 


Social psychologists have identified numerous constructs that reflect affec- 
tive characteristics. This volume will delimit the potential list to those vari- 
ables most relevant to the school experience. Included will be the following: 
attitudes, self-esteem, interests, and values. Prior to providing the mechan- 
ics of operationally defining these variables in the next chapter, we begin 
with their concise conceptual definitions and relevance to school programs. 


Attitudes 


Kiesler, Collins and Miller (1969) state that “ 
played a central role in the development of A 
(p. 1). The techniques of attitude measurement and scaling, as well as the 
theoretical and empirical issues of attitude change have received much 
attention dating back prior to World War II. Allport (1935) termed attitude 
as “the most distinctive and indispensable concept in contemporary social 
psychology" (p. 798) and offered the following definition: 

An attitude is a mental and neural st 

ence, exerting a directive or dynamic 

all obiects and situations with which i 


the concept of attitude has 
merican social psychology 


ate of readiness, organized through experi- 
influence upon the individual's response to 
t is related. (Allport, 1935, p. 810) 


However, no single definition of attitude has emerged over the years. 


In a review by Severy (1974), two schools of thought regarding the 


Structural nature of attitudes were described. The first school can be repre- 
sented by Thurstone's definition of attitude as 


Proponents of this school of thought were known as “ 
denote their conception of attitudes 
negative or favorable-unfavorable) 

The second school of thought w 


unidimensionalists" to 
from only an evaluative (e.g., positive- 
Perspective. 


as supported by the component theorists 
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who conceived of attitudes on more than just an evaluative dimension. 
Wagner’s definition illustrates this view: 


An attitude is composed of affective, cognitive, and behavioral components that 
correspond, respectively, to one’s evaluations of, knowledge of, and predisposi- 
tion to act toward the object of the attitude. (Wagner, 1969, p. 7) 


In a later volume, Triandis presented the following definition: 


Anattitude is an idea charged with emotion which predisposes a class of actions to 


a particular class of social institutions. (Triandis, 1971, p. 2) 

o illustrates the component theorists' view 
components: cognitive, affective, and 
behavioral. The cognitive component is a belief or idea, which reflects a 
category of people or objects such as microcomputers. The affective compo- 
nent represents the person's evaluation of the object or person, and is the 
emotion that charges the idea—that is, for example, féeling positive about 
working with the computer. Finally, the behavioral component represents 
Overt action directed toward the object or person. It represents a predisposi- 
tion to action such as enrolling in an optional microcomputer course at 


School. 

_ Several writers have combined com 
tions. In his article entitled "Attitude 
combines several definitions to state that 
earned predispositions to respond positively 
or negatively to certain objects, situations, concepts, ог persons. As such, they 
possess cognitive (beliefs or knowledge). affective (emotional, motivational), and 
performance (behavior or action tendencies) components. (Aiken, 1980, p. 2) 


Campbell (1950) and Green 
ity among the various attitu 
ment was really present regar 
attitude, and suggested that soci 
response to an object" (p. 32). Agr 
Guttman’s (1944) work in attitudes an 


implies a consistency of responses" ( ска 
measurement of attitudes would consist of obtaining responses to a sample 


of statements about opinions from a sample of people. Whether (ees re- 
Sponses to the affective statements relate to cognitive or behavioral aspects 
Of attitude can be determined through empirical studies. For па 
Wicker (1969) reviewed 30 studies relating attitude to . ыша 
the two to be not directly related. In reanalyzing the same ata, Shaw (cite 


This comprehensive definition als 
that attitudes are composed of three 


mon elements from the various defini- 
Measurement and Research,” Aiken 


attitudes may be conceptualized as | 


(1954) also attempted to find some communal- 
de definitions. Campbell proposed that agree- 
ding the implicit operational definition of 
al attitudes are reflected by a “consistency in 
eeing with this view, Green analyzed 
d stated that the “concept of attitudes 
Green, 1954, p. 336). That is, the 
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in Severy, 1974) focused on seyen of the 30 studies which met his standards 
of appropriate measurement techniques. In these seven studies the rela- 
tionship tended to be higher and Shaw concluded that attitudes can lead to 
specific behavior given a particular situation and constraints. Thus, the 
cognitive and behavioral components of an attitude, as defined by Triandis 
(1971), аге important considerations, but behavior should be considered as 
a function ef one’s attitude іп the context of the particular situation. | 
Іп this volume the focus for attitudes will be clearly placed оп the affective 
component. Consistency of responses to statements about ideas, people, 
and objects will be employed to reflect social attitudes. Emphasis will be 


placed upon the popular definition provided by Fishbein and Ajzen, which 
states that attitudes reflect 


a learned predisposition to respond to a consistently favorable or unfavorable 
manner with respect to a given object. (Fishbein and Ajzen, 1975, p. 6) 


Readers are encouraged to read Fishbein and Ajzen’s (1975) book entitled 
Belief, Attitude, Intention, and Behavior: An Introduction to Theory and 
Research, which describes the Expectancy-Value Model. In this model atti- 
tudes are distinguished from beliefs in that attitudes represent the indi- 
vidual's favorable or unfavorable evaluation (i.e., good-bad) of the target 
object while beliefs represent the information the individual has about the 
object. Attitudes toward objects are determined by joining the product of 
the evaluation of a particular attribute associated with the target object and 
the subjective probability (i.e., belief) that the object has the attribute. 
Accordingly, the evaluation of the attribute contributes to the individual's 
attitude in proportion to the strength of his beliefs (see Fishbein and Ajzen, 
1975, p. 222-223). In chapter 3 Fishbein’s work will be further described as it 
forms the basis for the scaling of all of the affective characteristics described 
in this volume (e.g., attitudes, self-concept, interest, and values). Fishbein 
and Ajzen's work clearly parallels Anderson's summary which states that 


attitudes are feelings that generally have a moderate level of intensity, can be 

either unfavorable or favorable in direction, and are typically directed toward 

some object (that is, target). The association between feelings and a particular 
target is learned. And, once learned, the fcelings are consistently experienced in 

the presence of the target. (Anderson, 1981, р. 33) 

Ав discussed by Tyler, attitudes appropriately appear in most statements 
of educational objectives. In content areas such as social studies the objec- 
tives usually pertain to "the development of objective attitudes toward 
alternate possible explanations of social phenomena and tow. 


қ a А ага policies for 
dealing with social problems” (Tyler, 1973, p. 5). While su 


ch an objective 
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has a cognitive component dealing with the recognition of objectivity in 
dealing with social problems, the affective (emotional) component attempts 
to free the student from safe, familiar views and promotes exploration of 
new views. 
Several educational programs also include statements of attitude en- 
hancement in their objectives. Typical statements begin with the phrase, 
students should develop positive attitudes toward” and end with a particu- 
lar target (e.g., reading, math, learning, school, teachers, or special educa- 
tion students). 
Popular measures of attitudes in educational settings are the Attitude 
Toward School K-12 (Instructional Objectives Exchange, 1972a), the Sur- 
vey of School Attitudes (Hogan, 1975), and the School Attitude Measure 
(Dolan and Enos, 1980). 


Self-Concept 


An early comprehensive definition of self-concept was presented by 


Coopersmith as follows:! 
1 makes and customarily maintains with 


.., the evaluation which the individua 
de of approval or disapproval, and indi- 


regard to himself; it expresses an attitu anm 
cates the extent to which the individual believes himself to be capable, significant, 


successful, and worthy. In short, self-esteem is a personal judgement of worthi- 
ness that is expressed in the attitudes the individual holds toward himself. It is a 
subjective experience which the individual conveys to others by verbal reports 
and other overt expressive behavior. (Coopersmith, 1967, pp. 4-5) 


After integrating features of several definitions such as Coopersmith’s, 
Shavelson et al. (1976) state that in broad terms self-concept “is a person s 
Perception of himself” (р. 411). The perceptions are formed through experi- 
ences with the environment with important contributions coming from en- 
vironmental reinforcements and significant people in one’s life (1.е., self- 
concept is learned). Shavelson et al, further identify seven features critical to 
defining the complex self-concept construct as follows: organized, multi- 
faceted, hierarchical, stable, developmental, evaluative, and differential. 
Interested readers are encouraged to read this important discussion. In a 
later article, Shavelson, Bolus, and Keesling (1980) present the results of 
administering six self-concept instruments to seventh and eighth grade 
Students. Covariance structure analysis supported the contention that self- 


Concept is causally predominant over achievement. 00 
Similar to the other affective characteristics identified in this volume, the 


8 INSTRUMENT DEVELOPMENT IN THE AFFECTIVE DOMAIN 


target, direction, and intensity of self-concept can be identified. The target 
of self-concept is usually the person but could also be areas such as the 
school (i.e., academic self-concept); the direction can be positive or nega- 
tive; and the intensity can range on a continuum from low to high. 

The self-concept construct has received considerable attention over the 
last 15 years due to the reemphasis on affective outcomes of education and 
the reported relationships between affective and cognitive measures. Pur- 
key’s (1970) book entitled Self-Concept and School Achievement was an 
important volume as it supported the relationship between self-concept and 
achievement and inspired much additional interest and research in examin- 
ing the relationships between the affective and cognitive domains. Readers 
are referred to the Shavelson et al. (1976) article, which reviews several 
self-concept studies and provides a comprehensive discussion of the valid- 
ity of self-esteem construct interpretations for several popular self-esteem 
instruments (e.g., Michigan State Self-Concept of Ability scale (Brookover 
et al., 1965); Self-Esteem Inventory (Coopersmith, 1967); Self-Concept In- 
ventory (Sears, 1963); Piers-Harris Children’s Self-Concept Scale (Piers and 
Harris, 1964). Other popular measures not reviewed by Shavelson et al. 
include the Tennessee Self-Concept Scale (Fitts, 1965) and the primary, 
intermediate, and secondary forms from the Instructional Objectives 
Exchange volume entitled Measures of Self-Concept K-12 (1972b). 

Several school programs include objectives pertaining to enhancing self- 
concept. Typical statements read as follows: Students will develop positive 
feelings of self-worth. Students will evidence positive perceptions of self in 


relation to peers (or school achievement—i.e., academic self or family 
social relations). 


Interest 


Interest measurement grew out of the early graduate school work by Cow- 
dery who reported the differential interests of lawyers, physicians, and 
engineers (cited in DuBois, 1970). On the basis of this work in the early 
1900s interest measurement became a focal point of vocational guidance 
through the extensive contributions of such researchers as Е. K. Strong and 
F. Kuder. Defining interests as “preferences for particular work activities” 
(Nunnally, 1978), most inventories developed during the 1900s have item 
content that reflects occupational and work activities and employ the “Like- 
Dislike” rating instructions. 

Examples of popular interest invent 


ories are the St - i 
terest Inventory, which was formerly th 1, 


е Strong Vocational Interest Blank 
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(Campbell, 1977); the Kuder Occupational Interest Survey, which was origi- 
men presented as the Kuder Preference Record (Kuder and Diamond, 
= Тће Minnesota Vocational Interest Inventory (Clark and Campbell, 
ае Vocational Preference Record (Holland, 1975); the Jackson Voca- 
(D'C nterest Survey (Jackson, 1977); the Ohio Vocational Interest Survey 
^ osta, Odger, and Koons, 1969). Interested readers are referred to 
ytowski s (1973) book Interest Measurement and Dawis' (1980) excellent 
article entitled “Measuring Interests" for discussions of the history and 
techniques of interest measurement. 
Similar to other affective characteris 
terests can be described with regard to their 
The targets of interests are activities; the direction can be described as 
ems or disinterested; and the intensity can be labeled as high or low. 
nterests with high intensity would tend to lead one to seek out the activity 
under consideration. 
According to Tyler, school objec! 


tics examined in this volume, in- 
target, direction, and intensity. 


tives in the area ої interests are quite 


justified when the school activity involved “can contribute to the individual's 
development, social competence, or life satisfaction” (Tyler, 1973, р. 4). 
These objectives should be designed to develop interests for future learning 
in a wide variety of major fields of knowledge so that the student desires to 
pursue several activities that will assist in building a “more comprehensive 
and accurate picture of the world” (p. 4). Furthermore, Tyler suggests that 
appropriate school affective objectives in the interest area should broaden 
student interest to learn important things from several fields as well as 
deepen student interest to attend to a few special content areas. Typically 
statements of educational objectives reflecting student interests would read 
as follows: Students will develop interest in listening to music. Students will 


develop an interest in reading. 


Values 


Rokeach (1968) argues that the 
all the social sciences. Ina later 
d Value Systems. Rokeach 


In his book Beliefs, Attitudes, and Values, 
Concept of values is the core concept across 
book entitled The Nature of Human Values an 


defines a value as 
end-state of existence is 


fic mode of conduct or 
erse mode of conduct or 


an enduring belief tnat a speci! ( 
personally or socially preferable to an opposite or conv 
end-state of existence. (Rokeach, 1973, p. 5) 


Clarifying the difference between an attitude and a value, Rokeach states 
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that an attitude refers to an organization of several beliefs around a specific 
object or situation whereas a value refers to a simple belief of a very specific 
kind: j 
This belief transcends attitudes toward objects and toward situations; it is a 
standard that guides and determines action, attitudes toward objects and situa- 
tions, ideology, presentations of self to others, evaluations, judgements, justifi- 
cations, comparisons of self with others, and attempts to influence others. 
(Rokeach, 1973, p. 25) 


Other writers have referred to values as “the importance or worth attached 
to particular activities and objects" (Aiken, 1980, p. 2); “preferences for life 
goals and ways of life" (Nunnally, 1978, p. 589); “а belief upon which a man 
acts by preference" (Allport, 1961, p. 454); and as a "conception of the 
desirable—that is, of what ought to be desired, not what is actually 
desired— which influences the selection of behavior" (Getzels, 1966, p. 98). 
The Getzels, Rokeach, and Tyler definitions of a value were summarized by 
Anderson in a most informative manner as follows: 


First, values are beliefs as to what should be desired (Getzels), what is important 
or cherished (Tyler), and what standards of conduct or existence are personally or 
socially acceptable (Rokeach). Second, values influence or guide things: behavior 
(Getzels); interests, attitudes, and satisfactions (Tyler); and a whole host of 
items, including behavior, interests, attitudes, and satisfactions (Rokeach). 
Third, values are endüring (Rokeach). That is, values tend to remain stable over 
fairly long periods of time. As such they are likely to be more difficult to alter or 
change than either attitudes or interest. (Anderson 1981, p. 34) 


The target, direction, and intensity of values can also be identified. 
According to Anderson, the targets of values tend to be ideas, but as the 
definition offered by Rokeach implies, the targets could also be such things 
as attitudes and behavior. The direction of a value could be positive or 
negative (or right-wrong, important-unimportant). Finally, the intensity of 
values can be referred to as high or low depending on the situation and the 
value referenced. 

In this volume two types of values will be discussed: work values and 
interpersonal values. Work values refer to satisfactions that people desire in 
their future work such as economic returns, altruism, and independence. 
Super's Work Values Inventory (1970) will be discussed as it is a popular 
instrument for assessing work value orientations for high school students. 
Interpersonal values represent values that people consider important in 
their way of life such as support, leadership, conformity, and benevolence. 
Gordon's Survey of Interpersonal Values (1960) will be discussed in a later 
chapter since it is often used in both the educational and business worlds to 
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assess interpersonal values. " 

The important role of values in an educational program is discussed by 
Tyler in his monograph entitled “Assessing Educational Achievement in the 
Affective Domain.” In this work Tyler uses a definition of values that indi- 
cates their role in influencing interests, attitudes, and satisfactions by stating 
that a value is 


an object, activity, or idea that is cherished by an individual which derives its 
educational significance from its role in directing his interests, attitudes, and 


satisfactions. (Tyler, 1973, p. 7) 


In arguing that these are sound, aesthetic, and good-health values that 
are appropriate for objectives of schooling, Tyler says that 


Since human beings learn to value certain objects, activities, and ideas so that 
these become important directors of their interests, attitudes, and satisfactions, 
the school should help the student discover and reinforce values that might be 
meaningful and significant to him/her in obtaining personal happiness and making 
constructive contributions to society. (Tyler, 1973, p. 6) 


As examples of such appropriate objectives reflecting underlying values, 
Tyler lists the following Citizenship objectives approved by lay panels for 
the National Assessment of Educational Progress: 


1. show concern for the well-being and dignity of others, 
2. participate in democratic civil improvement, and 
3. help and respect their own families. (Tyler, 1973, p. 6) 


Relationships Among the Affective Characteristics 
finitions of affective character- 


This chapter has examined the conceptual de | г 
to the school experience: atti- 


istics selected on the basis of their relevance ud 
tudes, self-concept, interests, and values. In general terms, attitudes were 


described as feelings toward some object; self-esteem reflected perceptions 
of self; interests represented preferences for particular activities; and values 


reflected beliefs in particular life goals and ways of life. 
Clarification of the similarities and differences among the constructs 


was obtained through Anderson’s (1981) discussion of their target, direc- 
tion, and intensity attributes. It should be emphasized that the constructs 
selected in this volume are clearly not independent. While many writers may 
disagree with respect to the criteria for a taxonomy, it appears that some 
general statements can be offered. Values anda related value system can be 
considered as central to one’s overall personality. Manifestations of one’s 
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values may be seen in One’s interests and attitudes. Some would say that 
interests and attitudes are quite similar in that attitudes are targeted toward 
objects and that interests really reflect attitudes toward activities. Clearly, 


Additional Readings 


Dawes, R. M. (1972). Fundamentals of attitude measurement. Now York: Wiley. 
Henerson, M. E., Morris, L. L., and Fitz-Gibbon, C. T. (1978 
attitudes. Beverly Hills: Sage. 


Insko, C. А. (1967). Theories of attitude change. Englewood Cliffs, NJ: Prentice- 


). How to measure 


validation of self-concept interpretations of test scores. In M. D. Lynch et al. 
(Eds.), Self-concept: Advances in theory and research. Boston: Ballinger. 
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Thurstone, L. L. (1928). Attitudes сап be measured. American Journal of Sociology, 


33, 529-544. 
Thurstone, L. L. (1931). The measurement of social attitudes. Journal of Abnormal 


and Social Psychology, 26, 249—269. 
Thurstone, L. L., and Chase, E. J. (1929). The measurement of attitudes. Chicago: 


University of Chicago Press. 
Wylie, R. C. (1979). The self-concept: Theory and research on selected topics. Lin- 


coln: University of Nebraska Press. 


2 CONSTRUCTING AFFECTIVE 
INSTRUMENTS 


Inchapter 1, conceptual definitions of selected affective characteristics were 
presented. This chapter will review a practical framework described by 
Anderson (1981) for operationally defining the affective variables. 

The content and construct validity of the affective measures are 
extremely dependent on the existence of appropriate operational defini- 
tions, which directly follow from the theoretically based conceptual defini- 
tions. In a later chapter procedures for examining the content and construct 
validity of the instrument will be described. In this chapter we will address a 
situation often confronted during program evaluation or research activities: 
From a theoretical perspective, you know what you want to measure (eig. 
attitude toward school subjects), but you are not sure how to develop your 
own instrument if no applicable instruments are available. 


Operational Definitions 
After the theory concerned with the affective characteristic has been thor- 
oughly reviewed, the next step is to generate the perceptions. attributes, or 
behaviors of a person with high orlow levels of this characteristic. Anderson 
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(1981) illustrates two similar approaches to this task: the domain-referenced 
approach and the mapping-sentence approach. The domain-referenced 
approach, modeled after Hively's (1974) work in developing domain- 
referenced achievement tests, is highly recommended for this task. When 
carefully implemented, the procedure leads to clear operational definitions 
of the affective characteristics, which properly follow from the conceptual 
definitions. Resulting instruments can then be used to generate data that 
permit valid inferences regarding an individual's location on the intensity 
continuum underlying the affective characteristic. Several instruments end 
in failure from a psychometric point of view due to a lack of clear correspon- 
dence between the intended conceptual and the actual operational defini- 
tion employed. We will address this validity issue further in a later chapter. 


The Domain-Referenced Approach м 


Іп the domin-referenced approach to affective scale construction described 
by Anderson (1981), the target and direction of the affective characteristic 
are first addressed and then the intensity aspect is considered. It is proposed 
that Anderson's technique be adapted to also include a statement of the 
a priori judgmentally developed categories the clusters of statements are 


intended to represent.! 

Table 2-1 illustrates th 
velop the Gable-Roberts Attitude Tov 
Roberts, 1983). The activity column speci ‹ 
operationalizing (һе affective characteristic attitude toward school subjects. 
Тһе second column contains the target object domain for the affective char- 
acteristic. Finally, the last column specifies the domain or content categories 
the instrument developers intended to build into the instrument on an a 
Priori basis as a result of the literature review. | 

Іп this example of attitude toward school subjects (table 2—1), the instru- 
ment was designed to cover several different subjects, so the general target 
was listed as “subject.” Readers should note that the target could be further 


broken into such areas as numbers and algebra. 


е domain-referenced approach employed to de- 
ғаға School Subjects Scale (Gable and 
fies the process to be followed in 


An Illustration 


To illustrate how the domain-referenced approach can be employed, the 


example in table 2-1 will be discussed. In step 1 the developers first 


identified the attitude target “school subjects.” Based upon the review of 
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literature, interviews of teachers. and the theoretical base underlying the 
program being evaluated or the other variables in the study, the a priori 
categories were then selected. In this example, the developers wished to 
build three categories of items into the attitude measure: General Interest, 
Usefulness, and Relevance. With the target and categories in mind, the de- 
velopers then described a class of applicable verbs and directional adjectives. 
In step 2, the target object, subject, was selected and applicable lists of 
verbs and adjectives were generated keeping in mind the category (e.g-, 
General Interest) of statements that were selected on an a priori basis. Step 
3 was quite simple, as one example from each domain was selected (e.g.. 
target: subject; verb: is; adjective: interesting; category; general interest) 
so that a draft statement could be listed in step 4 (e.g., The subject is 
interesting.). 

Step5isa crucial step asit involves developing several statements that are 
semantic transformations of the first statement. These transformations must 
reflect the domain attributes selected for the first statement. The easiest type 
of transformation is a rather direct reuse of essentially the same words. For 
example, the original statement would read “Тһе subject is interesting" and 
the transformations would pick up on the word interest to yield statements 
such as “Тһе subject does not interest me" or "I have no interest in the 
subject." In addition to these somewhat direct transformations, it is recom- 
mended that different words from the adjective and verb lists be selected 10 
yield similar transformations within the same a priori content category- 
Examples of such statements from the General Interest category would be 
"I really enjoy the subject" and “I find the subject to be a real bore." 

The later importance of developing good transformed statements lies in 
the fact that all of the resulting statements should, in this example, reflect the 
a priori category of General Interest. It is hoped that content similarities 
among these statements will lead later respondents to provide internally 
consistent responses to the items that have been clustered on ana priori basis 
into the category “General Interest" (see the discussion of consistent ге“ 
sponses underlying the definition of attitudes in chapter 1). For example, 4 
student really liking the subject should tend to agree with the statements 
listed in step 5 of table 2-1: “Тһе subject is interesting" and “1 really enjoy 
the subject”; they should tend to disagree with the statement “I find the 
subject to be a real bore." To the extent that the respondents consistently 
rate the statements in this manner, the categories built into the instrument 
on an a priori basis will tend to emerge in the later data analysis to be fac- 
tors or constructs measured by the instrument. Inconsistencies in responses 
will tend to lower internal consistency reliabilities and result in meaning” 
less (invalid) scores from the instrument. This discussion of validity and ге“ 
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liability will be suspended until a later chapter. The point here is that this 
early stage in the instrument development process, where the domains are 
specified and several statements generated, is a most crucial aspect of the 
overall instrument development process. Treat it seriously, take lots of 
time and be creative! 

Table 2-2 illustrates the use of the domain-referenced approach for 
developing the Gable-Roberts Attitude Toward Teacher Scale (Gable and 
Roberts, 1982). Readers may wish to see how the statements and their 
transformations were generated. For further examples, see Anderson 
(1981) where the technique is illustrated for the areas of "attitude toward 
mathematics” and “interest in teaching.” 


Item Content Sources 


But where do all these targets, verbs, adjectives, and categories come from? 
As noted earlier, a well-done literature review will be a rich source of 
content. The theoretical work behind the affective characteristic, as well as 
studies using Osgood’s work with the semantic-differential technique may 
be a rich source of adjectives. (See Osgood, Suci, and Tannenbaum, 1957. 
p. 43.) Applicable verbs will, of course, depend on the target you have 
selected. For example, the attitude toward teacher illustration in table 2-2 
leads one to focus on verbs that will describe typical behaviors of teachers. 
The literature on teacher evaluation (see for example Brophy and Good, 
1974; Good and Brophy, 1978) will assist in generating the necessary verbs: 

The most useful technique, though, is the interview/observation process. 
After identifying the attitude target and the group to be administered the 
instrument (e.g., school subjects; high school students), spend considerable 
time talking with students about how they feel about school subjects: 
Through tape recordings and notes, you should be able to find several 
applicable verbs, adjectives, and even possible categories suggested by the 
students. Finally, a group of graduate students or teachers can serve as 
excellent resource people for this task. Tell them exactly what you are doing 
and brainstorm lists of verbs and adjectives. Also, give these individuals 2 
sheet of paper with a definition of the a priori category and the first gener- 
ated statement at the top and ask them to generate as many alternate ways tO 
say the same thing as possible. If you are fortunate, you will have access to à 
group of graduate students in gifted education. These tasks are viewed by 
gifted folks as a real challenge to their creativity and can be the answer 19 
your need for several parallel statements. 
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Summary 


In this section we have reviewed a procedure for generating operational 
definitions for affective characteristics. Once the target and a priori catego- 
ries are specified, appropriate verb and adjective domains are developed in 
light of the domain categories the developer desires in the instrument. By 
selecting from these domains, developers are then able to link the target and 
adjectives with verbs to generate sentences that become the statements or 
items on the instrument. If this process is successful, the conceptual defini- 
tions underlying the affective characteristic will be operationally defined. 
During the later stage of content validation in the instrument development 
Process, the relationship between the operational and conceptual definitions 
will be supportive of the content validity of the instrument (see chapter 4). If 
the operational definitions are poorly constructed, it is doubtful the instru- 
ment could have much content validity. Later empirical examinations of 
construct validity and internal consistency reliability would most likely also 


be quite depressing to the developer. 


Note 


! Later, after response data are gathered using the instrument, these categories could be- 


come the constructs measured by the instrument. 


$.C.E R T., West Banga Жаға 
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3 SCALING AFFECTIVE 
CHARACTERISTICS 


ffective characteristics can be described as 
having intensity, direction, and a target. A framework was described for 
developing statements for an affective instrument by carefully sampling 
from a universe of content. After developing such statements, we typically 
obtain the responses of selected individuals to these statements and claim 
that we have measured the intensity and direction of their affect toward the 


Particular target object. 


In chapter 1 we noted that a 


Measurement 


The basis of the above activities is the process we call measurement. As 
Wright and Masters (1982) have noted, measurement begins with the con- 
cept of a continuum on which people can be located with respect to some 


trait or construct. Instruments in the form of test items are used to generate 
numbers called measures for each person. It is important to realize that the 
test items are also located on the continuum with respect to their direction 


and intensity. Variations of judgmental and empirical techniques are used to 
scale the items. Some scaling procedures calculate numbers called calibra- 
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tions which indicate the location of an item on the underlying affective 
continuum. During the measurement process people are measured by the 
items that define the trait underlying the continuum. That is, people are 
located on this abstract continuum at a point which specifies a unit of 
measurement to be used to make “more or less" comparisons among people 
and items. According to Wright and Masters, the requirements for measur- 
ing are: 


1. the reduction of experiences to a one dimensional abstraction (і.е., 
continuum), 

2. more or less comparisons among persons and items, 

3. the idea of linear magnitude inherent in positioning objects (i.e., peo- 
ple) along a line, and 

4. aunit determined by a process which can be repeated without modifica- 
tion over the range of the variable. (Wright and Masters, 1982, p.3) 


The process, which can be repeated without modification, is actually the 
measurement or scaling model used to describe how people and items in- 
teract to produce measures for people. Several such scaling models have 
received much attention over the last 60 years with particular emphasis 
during the last 20 years. In this chapter we will describe some models that 
have been utilized for scaling affective variables. Differences among the 
models with respect to the process used for calibrating the items and locat- 
ing people on the continuum underlying the affective construct will be 
discussed. The techniques to be presented include Thurstone's (1931a) 
Equal-Appearing Interval technique, some recent developments using La- 
tent Trait Theory reported by Wright and Masters (1982), Likert's (1932) 
Summated Rating technique, and Osgood's Semantic Differential technique 
(Osgood, Suci and Tannenbaum, 1957). Finally, we will illustrate a proce- 
dure suggested by Fishbein and Ajzen (1975) for combining belief and prob- 
ability components to measure attitudes. Following the description of the 
attitude scale techniques, the chapter will conclude with a discussion of the 
properties of ipsative and normative scales. 

All of the scaling techniques will be presented in the context of Fishbein's 
Expectancy-Value Model for measuring attitudes (Fisbein and Ajzen, 
1975). While Fishbein’s model addresses the affective characteristic atti- 
tudes, it can be generalized to the other chara 


cteristics described in chapters 
Тапа 2: self-concept, values, and interests. 
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Fishbein’s Expectancy-Value Model 


As we noted in chapter 1, Fishbein has argued that attitude should be 
measured by a procedure that locates the individual on a bipolar evaluative 
dimension with respect to some target object (Fishbein and Ajzen, 1975). 
This attitude scaling process takes place in the framework of an expectancy- 
value model. To scale people using the expectancy value model, Fishbein 
distinguishes between attitudes and beliefs. Whereas attitudes are described 
as the individual’s favorable or unfavorable evaluation (і.е., good—bad) of 
the target object; beliefs represent the information the individual has about 
the object. In this context, then, the belief links the target object to some 
attribute. For example, table 2-2 contained operational definitions for 
“attitude toward teacher.” In step 4 of the table we developed the statement 
“This teacher makes learning fun.” This statement is, in fact, a belief that 
“teacher” to the attribute “makes learning fun." A 
person's attitude, then, is a function of his beliefs at a given time and is based 
upon the person's total set of beliefs about the target object and the evalua- 
tions of attributes associated with the object (see Fishbein and Ajzen, 1975, 
chs. 3 and 6). 

Fishbein's model for scaling peoples' attitudes is based upon the rela- 
tionships between beliefs about a target object and attitude toward that 
Object. In this model, different beliefs and evaluations of the target object 
are combined in a summative manner to produce an index representing the 
Overall attitude toward the target object. The integration process described 
з Fishbein (Fishbein апа Ajzen, 1975, p. 29) is presented here as equation 
3.1, 


links the target object 


Ао= У ће (3.1) 


i=l 


attitude toward some object, O; b; is the belief i 
about О, i.e., the subjective probability is that O is related to attribute i; e;is 
the favorable evaluation of attribute i; and n is the number of beliefs. As 
Stated in chapter 1, an individual's attitude toward some object is deter- 
mined by forming the product of that individual's favorable-unfavorable 
evaluation of each attribute associated with the target object and his/her 
subjective probability that the object has the attribute, then summing these 
products across the total set of beliefs. Thus, the evaluation of the attribute 
contributes to the person's attitude in proportion to the strength of his/her 


beliefs (Fishbein and Ajzen, 1975, p. 222-223). 
In this chapter we will describe some attitude scaling techniques all of 


where Ao is the overall 
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which arrive at a single attitude score based upon responses to statements of 
beliefs. All the techniques yield attitude scores that represent the person's 
location on a bipolar evaluative dimension with respect to the target object. 
For all procedures attitude scores are obtained from the product of beliefs, 
b, about the object and evaluations, e, of the attributes of the object. One 
difference among the procedures represents the relative weights placed on 
the beliefs (b) and the evaluations (e) of the respective attributes in develop- 
ing the attitude score. Another difference lies in the properties of the indi- 
vidual items. Depending on the technique employed, items selected for 
inclusion in the instrument are based upon different criteria, which results in 
selecting items with different item characteristic curves or tracelines. (These 
terms will be made clear in a later section.) 

In the sections that follow, we will clarify and illustrate the similarities 
and differences among the attitude scaling techniques. Each technique will 
be described in the context of Fishbein’s expectancy-value model. 


Thurstone Equal-Appearing Interval Scale 


The Thurstone technique was originally developed by Thurstone and Chave 
(1929) and has been described by Thurstone (1931a) and Edwards (1957) 
(see also Anderson, 1981; Fishbein and Ajzen, 1975; Nunnally, 1978; Thur- 
stone, 1927, 1928, 1931b, 1946). Employing the expectancy-value model, 
the Thurstone technique begins with a set of belief statements (i.e., attri- 
butes) regarding a target object. These statements are then located (i.e., 
calibrated) on the favorable-unfavorable evaluative dimension through а 
judgmental procedure that results іп a scale value for each belief statement. 
In the context of Fishbein’s expectancy-value model specified in equation 
3-1, the favorable-unfavorable evaluations (е) of the belief statements ог 
attributes are obtained from an independent set of judges whose ratings are 
used to place the statements on the evaluative continuum. The values of € 
can then range from the highest to lowest value used in the judges' rating 
procedure (e.g., 1—5, 1-11, etc.). On the other hand, when later respon- 
dents select individual attributes as being characteristic of the target object, 
they are actually specifying the values of b or the probability that the target 
object possesses the stated attribute. The values of b in the Thurstone tech- 
nique are 0 if the statement is not selected and 1 if the respondent selects the 
statement as a characteristic of the target object. 

Prior to describing the steps in the Thurstone procedure we need to 
mention Thurstone's early use of paired comparisons. After the set of items 
had been scaled by the judges, items were paired with other items with 
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similar scale values; and sets of раігей comparisons were developed. Іп 
some cases each item was paired with all other items from other scales on 
the instrument and respondents were asked to select the item from the pair 
that best described the target object. The Edwards Personal Preference 
Schedule (Edwards, 1959) illustrates the use of such paired comparisons. 
The problem with this procedure is that a large number of paired items is 
needed to measure the affective characteristic—for k items, k(k — 1)/2 pairs 
would be necessary. As a result, Thurstone (1931a) developed the succes- 
sive interval and equal-appearing interval techniques for measuring affec- 
tive characteristics. The equal-appearing interval procedure has proven to 
be the most popular and will be described in the next section. Readers 
wishing to study the paired comparisons and successive interval techniques 
are referred to Edward’s (1957) book Techniques of Attitude Scale Construc- 
tion. 

With this overview in mind, we now 
involved in developing an attitude instrument using 
Appearing Interval Technique. 


proceed to detail the two phases 
Thurstone’s Equal- 


Phase |: Пет Selection 


Using the procedures described in chapter 2, a large set of items (е.2., 
30-50) is constructed to operationally define the affective characteristic. A 
group of judges very similar to the future respondents to the instrument is 
then asked to rate the items with respect to the extent that the items describe 
the affective characteristic. It is essential that the judges realize that they are 
not agreeing or disagreeing with the items. Rather they are assisting in the 
quantification of the intensity (1.©., favorable-unfavorable) of the state- 
ment. That is, the items are being calibrated or scaled in that the ratings of 
the judges will result in locating the item on the psychological continuum 
underlying the affective characteristic. | 
For example, consider the scale developed by Kahn (1974) for evaluating 
university faculty teaching. Thirty-five items were developed which de- 
scribed the process of teaching (see table 3-1). A sample of approximately 
300 college students responded to the form in table 3—1 by indicating how 
characteristic the statement was regarding the quality of teaching. Note that 
the respondents were not asked to rate their particular teacher. Instead, 
they were assisting in scaling the item pool with respect to the degree of 
teaching quality exhibited in each item. While this form employed a 4-point 
scale, the original work of Thurstone (19312) employed an 11-point scale. 
The only guideline for the number of scale points is that the response format 
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must result in adequate variability. It appears that less than 11 points would 
achieve this goal. 

After the judges’ data are obtained, the distribution of the judges’ ге- 
sponses to each item is generated so that the mean or median response can 
be obtained. It is this value that represents the scale value (i.e., weight) for 
each item on the psychological continuum underlying quality teaching. In 
addition to the scale values, the interquartile range is calculated for each 
item, to represent a measure of variability in the judges opinions. This 
statistic, called the criterion of ambiguity, is used to screen out items on 
which the judges disagree with respect to the degree of affect contained in 
the item. The criterion of ambiguity represents the first criterion by which 
items are selected (i.e., items that are equally spaced and nonambiguous). 

It is also important that the calibration of the items results in scale values 
that have generality beyond the particular sample of judges used to locate 
the items on the continuum. According to Thurstone, 


If the scale is to be regarded as valid, the scale values of the statements should 
not be affected by the opinions of the people who help construct it. This may 
turn out to be a severe test in practice, but the scaling method must stand such a 
test before it can be accepted as being more than a description of the people who 
construct the scale. At any rate, to the extent that the present method of scale 
construction is affected by the opinions of the readers who help sort out the 
original statements into a scale, to that extent the validity or universality of the 
scale may be challenged. (Thurstone, 1928, p. 547-548) 


For this reason Kahn (1974) sampled several groups of college students 
and compared the scale values for different subgroups of judges on the 
basis of sex and level of program (undergraduate, masters, doctorate). The 
process of scaling the items continued until stable scale values were found. 

Once the scale values are found to be stable across groups of judges, the 
actual item selection takes place. If there are 50 items and you desire a 
25-item instrument that adequately spans the psychological continuum, 
simply select every other item and the result is actually two parallel forms 
of the measure. Careful selection of the scale values results in what Thur- 
stone called an equal-appearing-interval scale (Edwards, 1957). In the 
Kahn (1974) example, 20 items were selected from the items in table 3-1- 
The items selected and their scale value weights are presented in table 3-2- 


Note that the weights have been included in the table but would not appear 
in the actual form. 
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INSTRUMENT DEVELOPMENT IN THE AFFECTIVE DOMAIN 


Phase ІІ: Locating People on the Continuum 


Once the items have been scaled, the location of people on the continuum 
proceeds as respondents actually rate the target object with respect to the 
affective characteristic. In our example using the instructions in table 3-2, 
respondents merely indicate which of the 20 attributes their teacher ex- 
hibits. The score for the teacher rated is then the mean or median of the 
scale values for those items selected. For example, a respondent checking 
items 1, 5, 6, 8, and 12 would actually give the instructor a rating of (3.169 
+ 3.377 + 3.887 + 3.324 + 3.707) + 5 = 3.49; whereas if items 7, 16, 18, 
and 19 were checked the rating would be (1.989 + 2.164 + 1.502 — .035) + 
4 = 1.41. Clearly, the first rating is higher since the student felt the teacher 
exhibited characteristics previously judged to indicate good teaching. 

After obtaining the scale values of a target group, one final analysis will 
represent the second criterion for selecting items in the Thurstone tech- 
nique. This procedure, developed by Thurstone and Chave (1929), is called 
the criterion of irrelevance and has been described by Thurstone (see 
Edwards, 1957, pp. 98-101) and Fishbein and Ajzen (1975, p. 70), and 
illustrated by Anderson (1981, pp. 243-248). The procedure is not often 
used but should gain support in that it examines the relationship between 
the judges’ ratings of favorable—unfavorable affect in each item during 
Phase I and the respondents’ scale values obtained after administering the 
items in Phase II. 

The purpose of the analysis is to identify items that yield responses that 
appear to represent factors other than the affective characteristic being 
measured. By employing the criterion one assumes that items with particu- 
lar scale values will be selected by people whose attitudes are located near 
the scale value on the evaluative continuum. Fishbein and Ajzen (1975, 
p. 70) and Anderson (1981, рр. 243-247) have illustrated the use of item 
tracelines or characteristic curves to represent the relationship between the 
proportion of people or probability of agreeing with a particular item and 
the item scale value. Figure 3-1 contains a modified version of Fishbein 
апа Ajzen`s tracelines for three items with low, median, and high scale 
values (i.e., unfavorable, neutral, and favorable items). The horizontal 
axis represents possible attitude scores and the vertical axis indicates the 
proportion of people selecting the item and obtaining the respective atti- 
tude scores. In practice, the values on the horizontal axis would represent 
ranges of attitude scores around the 11 points (e.g., 2.5-3.4). After gener- 
ating the traceline for each item the criterion of irrelevance is examined by 
considering the peak of the curve in relation to the location of the item on the 
evaluative dimension (i.e., the scale value). Items passing the criterion of 
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Figure 3-1. Hypothetical trace lin 
Fishbein and Ajzen, 1975, p. 70). 


values (Adapted from 


irrelevance will exhibit tracelines that peak at the scale score category con- 
taining the item’s scale value. When this happens, we conclude that the item 


will most likely be selected by people whose attitudes are near the scale 
value on the attitude dimension and the item is retained for the final form of 


the instrument. 

Prior to leaving this section, we note that for Thurstone scales we do not 
expect a high correlation between the obtained attitude score and the selec- 
tion of items. This is the case since the relationships between item selection 

figure 3-1 are curvilinear in 


and scale values depicted for the three items in figu аге си 
nature with (ће shape of the traceline differing for items with different scale 


values. Therefore, іп the Thurstone technique, items are not selected onthe 
basis of the relationship between item endorsement and the attitude score. 
In a later section we will note that the opposite is true for the Likert tech- 


nique. 
. Some researchers have shie 
is true that the procedure is time con 


judges can result in different scale va 
judges’ opinions means the weights are unstable and suspect for future use. 


Nunnally (1978) suggests that the more easily constructed summative 
(Likert and semantic differential) rating scales tend to be more reliable. It is 
recommended that researchers using the Thurstone technique place much 


emphasis on the stability of the scale values across subgroups of judges and 
Select another procedure if stability is not present. Those wishing to review 


d away from developing Thurstone scales. It 
suming in that different subgroups of 
lues for the items. This variability in 
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several Thurstone scales are referred to Shaw and Wright’s (1967) book 
Scales for the Measurement of Attitudes. 


Latent-Trait Models 


In the previous section Thurstone’s procedure for scaling items and measur- 
ing people was described. A key feature of this technique was the ordering of 
items along a continuum (i.e., calibrating items) of increased affect (e.g., 
attitude) and then measuring a person’s location along this continuum. 


Cognitive Instruments 


During the last 10 years, /atent-trait techniques have received considerable 
attention in an attempt to achieve this same goal for cognitive achievement 
tests. These techniques have become quite popular for describing the prob- 
ability that a person with a given ability level will pass an item with given 
item parameters. During the phase of item calibration three possible item 
parameters can be estimated from the response data: difficulty, discrimina- 
tion, and chance-success level. A feature and an assumption of the latent 
trait models is that the estimated parameters are sample free—i.e., the 
same regardless of the sample of people employed. In addition, once the 
items have been calibrated, a subset of the items can be used to measure 
a person’s trait level. Thus, the latent-trait model also produces item-free 
trait measurement. A measurement model yielding both sample-free and 
item-free measurement has been crucial in the several statewide achieve- 
ment-proficiency testing programs for selecting test items and creating 
parallel forms for longitudinal comparisons. 


The early work with latent-trait models for dichotomously scored 


achievement tests (i.e., correct, incorrect) was carried out by Rasch (1966) 


in the 1950s and has been popularized by Wright as described in an excellent 
source entitled Best Test Design (Wright and Stone, 1979). The Rasch model 


15 a one-parameter model which assumes that items differ only in terms of 
item difficulty. 


Affective Instruments 


While the various models (i.e., опе, two, and three parameters) have re- 
ceived considerable attention from psychometricians in the area of achieve- 
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ment testing, few аге aware that several models аге available that are sim- 
ple extensions of Rasch's Dichotomous Model (1966) to other response 
formats. In this section we will present the work of Wright and Masters 
as described in their book Rating Scale Analysis (1982). Specifically, the 
rating-scale model described by Andrich (1978) and Masters (1980) will be 
Presented in the context of scaling (i.e., calibrating) items and measuring 
people's affective characteristics with items employing Likert response for- 
mats. The discussion will address four of the areas and questions addressed 
by the latent trait techniques: 
Item calibration: Where is each item located on the affective con- 
tinuum? 
2. Measure people: Where is each person located on the affective con- 
tinuum? 
3. Item fit: How well do the items fit the model? 
4. Person fit: How well do the people fit the model? 
Since the technique is not well understood by many aifective instrument 
developers and will be frequently used in the near future, the description will 
be presented in some detail. Interested readers are referred to papers by 
Koch (1983) and Masters and Hyde (1984) which illustrate using the latent- 


trait model for Likert scaling. 


Calibrating Items for Measuring People 


d extensively for affective instru- 


The Likert response format has been use 
onsists of ordered response 


ments. The typical 5-point agree continuum с 
alternatives such as: 


Strongly Agree. Agree Undecided Disagree Tug pen 
3 2 


5 4 


According to Wright and Masters (1982, p. 4 
represents the selection of the А" step over the (K — 1)" step on the re- 
Sponse continuum. (Note that a 5-point response format has four steps such 
as 1-2 = 1 step, 2-3 = 1 step, etc.). Given this format, a person selecting the 
“agree” option has chosen “disagree” over “strongly disagree” and “ипде- 


cided” over “disagree,” but has not chosen “strongly agree" over “agree.” 
These ordered steps in the response format represent the relative difficulties 
i ssumed to be constant across all of the 


in responding to the item and are a 


Items. f | 
Given this assumption, the Rating Scale Analysis model is run on a 
computer program called CREDIT which employs one of several (e.g., 


8), “completing the kth step" 
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— Strongly Disagree 


Probability 


Disagree 


-3 id = 0 1 2 3 
i Logits 


Attitude Person Parameter A 


Figure 3-2. Strongly disagree and disagree category probabilty curves for a 
hypothetical item. 


PROX, PAIR, CON, UCON) procedures for estimating the item param- 
eters (i.e., scaling the items) and the person parameters (i.e., measuring 
people). If the UCON procedure is employed, the person and item param- 
eters are estimated simultaneously. To be more specific, the procedure 
estimates a position for each person on the affective-variable continuum 
being measured (i.e., person parameter), a scale value for each item (i.e.. 
item parameter), and m (e.g., 4) response thresholds for the m + 1 (e.g.. 
4 for a five-point Likert agreement scale) response categories employed. 
Figure 3-2 contains these values for the item characteristic curves, which 
are called ogives, for a hypothetical attitude item. For ease of illustration 
only the "strongly disagree" and "disagree" curves are included for the 
5-point scale. (Note that for the Rasch Dichotomous Model employed in 
achievement tests only one such curve exists per item.) The purpose of this 
curve is to estimate the probability of person n responding “strongly dis- 
agree" (i.e., 1) rather than "disagree" (i.e., 2) on this item as a function of 
the person's affective characteristic parameter А, and an item scale value 
parameter SV, which dictates the transition from responding “strongly dis- 
agree” to responding "disagree" on the item (see Wright and Masters 1982. 
рр. 55, 128, 130). The values marked off on the ho 


rizontal axis represent 
units called /ogits from the logistic scale, which varie 


5 from —3 to +3. (The 
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logistic scale can be converted to the normal curve using a conversion fac- 
tor.) High logit values are associated with items with which it is easy to agree 
2... апа а high attitude score) and low values with items with 
а mas. to disagree (i.e., high difficulty andlow attitude score). Once 
Бе же M үа curves for an item have been centered on their scale 
свира: ‘ E estimate the probability of any person selecting any one ofthe 
| spo ea ternatives. To do this, we locate the person’s attitude estimate in 
ogits (provided in the computer output) on the horizontal axis and project a 
vertical line to the height of one of the curves, then read off the probability 
value from the vertical axis. For example, in figure 3-2, a person with an 
attitude estimate of —.1 logits has a .35 pobability of disagreeing and a .65 
probability of strongly disagreeing with this item. Further, the average scale 
value in logits for aset of items will always equal zero. After calculating the 
standard deviations (S) of the scale values іп logits, we can determine the 
probabilities of selecting each response option for people with attitude raw 


score levels of X + S, X +25, X — S, and X — 25. 


Item Fit 


ng-scale model attempts to locate the items on 
of peoples' responses to the items, 
pose. A feature of the procedure is a 
-scale model. This r-test statistic 
s support the existence of a con- 
tic. Items not fitting the model 


Earlier we noted that the rati 
бе affective continuum. Оп the basis 
I values were generated for this pur 
pia each item fits the rating 
ie cts the extent to which the responses 
um underlying the affective characteris 
can then be discarded from the item set. 


Person Fit 


des useful information regard- 
esized model. People whose 


ifficulty ordering can be easi- 
ell as other 


The CREDIT computer program also provi 
ing the extent to which people fit the hypoth 


Wer Mein are inconsistent with the statement d 
y identified for follow-up study of their response frequencies as W 


Cognitive and affective characteristics. 
Wright and Masters (1982) also illustrate how the analysis of person fit can 
ich result when not all 


be used to examine the response styles, or sets, wh 
People use the response options in the same way. That is. some people may 


tend to select extreme responses. “agree” responses, ОГ the "true" option 
for various rating scales. Тһе common practice is to include both positive 
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and negative item stems to help control for the possibility of a response style. 
Wright and Master (1982, ch. 6) provide an interesting illustration of how 
the rating-scale model analysis of “person fit" сап be used to ascertain if the 
positive and negative items really measure the same affective characteristics. 
Ten positive and 10 negative attitude toward drugs items were compared for 
75 students. The person-fit analysis indicated that the positive and negative 
item stems did not provide consistent information regarding a person’s atti- 
tude and should not be combined into a 20-item instrument. Since the goal 
was to locate each person on the affective continuum, differences in re- 
sponse styles would present a problem for the combined 20 items. Whereas 
most developers tend to routinely combine the items, the rating-scale model 
and UCON procedure provide the necessary empirical support for using а 
combination of positive and negative items. 

Regarding Fishbein’s expectancy-value model (Fishbein and Ajzen 
1975), the evaluation (e) of the favorable—unfavorableness of the items is 
initially examined by the developer and then empirically determined as the 
items are placed (calibrated) on the attitude continuum on the basis of the 
response data. The belief values (5) initially take on the values of the re- 
sponse scale employed and are then used to locate (i.e., measure) people on 
the affective characteristic using the latent-trait model. 

In summary, this section has described some of the features of using 
latent-trait techniques for developing and analyzing affective instruments. 
The purposes of the techniques were to locate items on a continuum of an 
affective characteristic and then measure people on the characteristic. The 
Rating Scale Analysis measurement model was identified for use in conjunc- 
tion with the UCON procedure for calibrating items and measuring people 
as well as for examining item and person fit to the model. 

Readers are encouraged to read Wright and Masters (1982) book, Rating 
Scale Analysis, for a detailed discussion of the available techniques. Their 
description of the analysis of the attitude toward drugs items (ch. 6) is 
particularly recommended since it illustrates the technique in the context 
of data tabled from the CREDIT computer program. The Rating Scale 
Analysis book ($24) can be obtained from the MESA Press, University 


of Chicago, Department of Education, 5835 South Kimbark Avenue. 
Chicago, IL 60637. 


Likert's Summated Rating Techniques 


Likert’s (1932) method of summated ratings has been appropriately popu- 
lar. According to Nunnally (1978), the Likert scales have been frequently 
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used because they are relatively easy to construct, can be highly reliable, and 
have been successfully adapted to measure many types of affective charac- 
teristics. Instruments employing Likert's technique contain a set of state- 
ments (і.е., items) presented on what has been called a “Likert response 
format". The 5-point strongly agree-strongly disagree format is commonly 
employed. Responses are then summed across the items to generate a score 
on the affective instrument. Examples of Likert scales discussed in earlier 
chapters include the Gable-Roberts Attitude Toward Teacher Scale (Gable 
and Roberts, 1982) and the Work Values Inventory (Super, 1970). 

With respect to Fishbein's expectancy-value model (see equation 3-1), 
we note that the evaluation (е) of the favorable-favorableness (i.e., 
positive- negative) of the statement or attribute with respect to the target 
Object is initially determined by the instrument developer and not by an 
independent set of judges as was the case for the Thurstone procedure. 
While a group of judges could later examine the items, they do not assist in 
Scaling the statements on the evaluative dimension, as was the case in the 
Thurstone technique. Thus, the values of e in equation 3-1 are set at — 1 and 
+1 for negative and positive item stems; the value of b takes on the values 
of the selected response format. For example, a 5-point strongly agree- 
strongly disagree format would yield values from 1 to 5 for b. In responding 
to the items, people are actually locating themselves on the underlying 
affective continuum through their intensity and direction ratings. People are 
thus scaled on the items by summing their item level responses across the 
items defining the characteristic. (Note that negative item stems should first 
be reverse scored so that the scoring process takes place in the same direc- 


tion for all items.) 


Item Selection 


t scale begins with the selection of a large num- 
nts that represent operational definitions of 


the affective characteristic). Approximately 10-12 items should be initially 
Written for each judgmentally derived affective category specified during the 
content validity stage. After the items are reviewed during the cantent 
validity phase, a pilot study should be conducted where a representative 
sample of people (6—10 times a$ many people as there are items) respond to 
the pilot form. Item analysis, alpha reliability, and factor analysis proce- 


; 5. 
dures can then be carried out as 54 and 


described in chapter: 15. | 
Тһе criterion for item selection used by Likert was the criterion of internal 
consistency, which results in elimin 


The development of a Liker 
ber of items (i.e., belief stateme 


ating items that do not relate to the 
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Figure 3-3. Hypothetical tracelines for two Likert items (Adapted from Fishbein 
and Ajzen, 1975, p. 72). 


affective characteristic being measured. Theoretically, this criterion spec- 
ifies that people with higher overall attitude scores should tend to agree 
with favorable or positive items and disagree with unfavorable or negative 
items.! Fishbein and Ajzen (1975) illustrate the resulting tracelines for two 
items meeting the criterion of internal consistency. A modified version of 
their illustration is presented in figure 3-3. Consider а 10-item attitude scale 
that employs а 5-point agreement response format. The possible range of 
scores from 10-50 is indicated on each baseline in figure 3-3. The vertical 
axis represents the probability of endorsing the item or the proportion of 
people agreeing with the item at various attitude score intervals. Oper- 
ationally, this reduces to the correlation of the item score with the total 
attitude score. Readers will note, contrary to the Thurstone technique, that 
the tracelines for the favorable and unfavorable items are linear and the 


correlation between item endorsement and the attitude score is the criterion 
used for item selection. 
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Response Formats 


бен аи format for the items should be consistent with the intent of the 
и pos Since several different formats are available, one has to select 
itid ile nat ma best provides the information desired so that the instructions 
For Miis p can be written to be consistent with the response format. 
how fr mple, do you want to know if respondents agree with the items, ог 
Vie ici they have experienced the event described by the item? To 
for ep this decision, table 3-3 presents several popular response formats 
e following intensity areas: agreement, frequency, importance, quali- 
m and likelihood. There are certainly several variations on these response 
Тл but the ones listed appear to be the most popular, especially the 
rmats listed in bold italic type. 
il If you decide to use a less-popular format or develop your own format, а 
5 е, study тау be needed to validate the ordered nature of the response 
ота Since the scaling of people will involve assigning ordered numbers 
е options, there must be no question that the rank order of the options is 
Correct and that the intervals between the options can be assumed to be 
equal. If there is any doubt, have à group of about 15 judges independently 
Tank the response options and discuss the intervals between the options. 
Confusion at this point will severely restrict the later assessment of reliability 
and validity of the instrument as well as the choice of parametic versus 
Nonparametric statistics to be used in the analysis of data. 


Number of Steps in the Response Format 


regarding the number of steps to use 
din table 3-3, several different 


Sever; sos 
Several opinions have been expressed 
s. The decision is important in 


ши Likert response format. Asillustrate 
Кене, һауе been employed by researcher The сер it doi 
using too few steps will result in failing to illicit the fine discriminations 

A Which the respondent is capable, while too many steps could create confu- 
оп and frustration. Thus, the number of steps to be used may differ across 
кош instruments and should be made on the basis of both practical and 
mpirical considerations. | . 
From a practical viewpoint. à greater number of steps 11 the scale will 
necessitate a higher level of thought for making fine discriminations between 
he Scale anchor points. If respondents become annoyed or generally con- 
Used by the large number of gradations used, they could become careless 
nt to consider 


кн Provide you with unreliable data. It is. therefore. ice rond dc 
he context of the items as well as the training, age. © ucational level, 
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motivation, and cognitive level of the respondents when selecting the num- 
ber of steps to be used. 

From an empirical viewpoint, several researchers have examined the 
issue of the optimal number of rating scale steps. Whereas Cronbach (1950) 
cautioned that the number of steps issue was also a validity issue, most 
researchers have focused on the reliability of the data obtained. As Nunnally 
(1978) and Guilford (1954) have noted, the reliability of the scale should 
increase as a function of the number of steps employed. This view has been 
supported in studies reported by Garner (1960); Finn (1972); Komorita and 
Graham (1965). According to Nunnally, increasing the number of steps 
from 2 to 20 generally increases reliability rapidly at first, levels off around 7, 
and increases little after about 11 steps. These findings are consistent with 
research reported by McKelvie (1978) and Jenkins and Taber (1977). 
McKelvie found that 5-point or 6-point scales were most reliable. A larger 
number of categories appeared to have no psychometric advantage and 
fewer than five categories could result in a lack of response discrimination. 
On the other hand, Komorita (1963) found that the reliability of a dichoto- 
mous format was not significantly less than the reliability of multistep scales. 
Similarly, Matell and Jacoby (1971) developed 18 different Likert scale 
formats (2 points to 19 points) for use with a 60-item modified version of the 
Allport-Vernon-Lindzey Study of Values (1960). Both internal consistency 
and stability reliabilities were found to be independent of the number of 
steps employed in the rating format. These findings were consistent with 
those reported by Komorita and Graham (1965). Likewise, concurrent 
validity information obtained through additional self-ratings of each Study 
of Values domain category indicated that validity, as defined in the study. 
was independent of the number of scale points. In another study, Comrey 
and Montag (1982) compared the factor analysis (construct validity) results 
of 2-point and 7-point response formats used on the Comrey Personality 
Scales. While the factor structure was found to be similar for the two for- 
mats, higher factor loadings reflecting higher intercorrelations among the 
variables were found for the 7-point scale. Comrey also reports that other 
researchers have reported differences in factor structures when dichoto- 
mous and multistep formats have been employed for the Rotter J-E Scale 
(Joe and John, 1973), the Personal Orientation Inventory (Velicer, Di- 
Clemente, and Corriveau, 1979), and the Eysenck Personality Questionnaire 
(Velicer and Stevenson, 1978). On the basis of these studies Comrey (1982) 
concluded that the 7-point format for personality inventories allows for 
distinctions by respondents and for more precise measures of the underlying 
factor structure. 


It appears that, while some researchers might disagree, there is little 
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25. d the use of 2-point response formats. Using such for- 
Д ponding and scoring easter, but could result in a loss 
of meaningful information. For example, although Coopersmith’s (1967) 
Popular Self-Esteem Inventory has always employed the 2-point (Like Ме- 
Unlike Me) response format, several studies have reported low-scale alpha 
reliabilities and factor structures that fail to support Coopersmith’s sug- 
gested scoring scheme (see, for example, Glover and Archambault, 1982). 
On the basis of the research reported, the reliability and validity issues seem 
to be best served through the use of from five to seven response categories. 

A related issue is the use of an odd or even number of steps in the 
response format, An odd number results in an “undecided” and “neutral” 
category which allows the respondent to not commit to either a positive or 
Negative direction. Proponents of its use suggest that no significant differ- 
ences are found in scale scores for the same respondents using both types of 
scales and that the selection of a neutral rating is no more ambiguous in 
meaning than the selection of any of the other categories (Dubois and 
Burns, 1975; Ory and Wise, 1981). On the other hand, Doyle (1975) notes 
that not using a neutral point forces more thought by the respondent and 
Possibly more precise discriminations. Consistent with this view, Ory and 
Wise found that significant pre-post attitude treatment differences were 
found for respondents to a 4-point scale, while no differences were found for 
another group randomly assigned to respond to a 5-point format. 

There is no definitive answer to the question of using a neutral point in 
response format. If you are concerned that respondents may not be respond- 
ing to the middle category to representa neutral attitude, you can follow the 
technique described by DuBois and Burns (1975) to examine this issue. 
Essentially, the procedure involves the plotting (vertical axis) of the respon- 
dent’s mean scale scores for each separate response category (horizontal 
axis). If the middle category is a true neutral response, the plot of means 
Should be in a relatively straight line and the standard deviations (disper- 
sion) of the scores around the means at each response category should be 


similar. While some researchers may wish to examine this issue empirically 
nt, Nunnally (1978) does not con- 


during the pilot stages of a new instrume š 
Sider this issue to be of great importance and concludes that it may best be 
e items and respondents. 


decided on а situational basis in the context of th 


Response Format Definition 


o use in the response format, the label- 


After selecting the number of steps t l 
mes an issue. In the previous section, 


ing of the response categories beco 
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table 3-3 presented several examples of all-category defined formats. Dixon, 
Bobo, and Stevick (1984) have reviewed several studies where different 
techniques were used to label the response categories. Techniques consisted 
of various combinations such as labeling all categories versus labeling only 
categories at the end of the response continuum; verbal versus numerical 
labels; and vertical versus horizontal labels. The influence of the formats on 
the results of the study was mixed; in some studies differences were noted 
and in other studies no differences were present. For data gathered by Dixon 
et al. (1984), no differences in perceptions of locus of control were found 
between the end-defined and all-category defined versions of a control scale 
for college students. The end-defined format was associated with higher 
standard deviations than the all-category defined format. 

The labeling of the response format categories appears to be an open 
question from an empirical point of view. It seems important to consider the 
age and educational level of the respondents in the context of the cognitive 
complexity of the rating task. Pilot studies of different formats are always 
good insurance during the process of instrument development. 


Positive and Negative Item Stems 


In chapter 2 we discussed the development of operational definitions for the 
affective characteristics that were the item stems to be included in the instru- 
ment. Prior to and after the content validity study it is important to consider 
the issue of positive and negative item stems. Will you use all positively 
stated items, negative items, or a set of mixed items? Unfortunately, the 
research in this area is not definitive but does offer some guidance. The 
latent-trait models described in an earlier section will be the focus of much 
future work in this area. 

The research focuses on the topic of response sets, or response styles, 
which Cronbach defined as “апу tendency causing a person consistently to 
make different responses to test items than he would have had the same 
content been presented in a different form” (Cronbach, 1946, p. 476). For 
example, individuals may tend to select the “neutral” category оп а 5-point 
"agree-disagree" response continuum, “true” on a “true—false” con- 
tinuum, or simply agree with most statements. Cronbach (1950) states that 
such response sets are possible on affective instruments that contain ambig- 
uously stated items, employ a disagree—agree response continuum, and 
require responses in a favorable or unfavorable direction. 

Efforts at identifying response sets have been mixed, though, since a true 
response set must represent a reliable source of variance in individual differ- 
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ences, be an artifact of the measurement procedure, and be partially inde- 
pendent of the affective characteristic measured (Nunnally, 1978). While it 
has proved difficult to isolate response sets that consistently meet these 
criteria, Rorer’s (1965) review of the response set literature supports the 
view that item content relates to a response set that represents a reliable 
source of response variance. Researchers have spent much time studying 
such areas as social desirability (і.е., giving socially accepted responses) and 
acquiescence (i.e., tendency to agree). The forced-choice item format has 
been suggested to assist in dealing with social desirability. In a later section 
On ipsative versus normative scores we will note that the forced-choice 
format may only partially address the tendency to give socially accepted 


responses. 

" The college student and adult-b 
acquiescence” or "agreement ten 

(1978), who feels that it may not be an imp 

criteria for the definition of a response set. 

the issue can be mostly eliminated by constructing the instrument to have a 


balanced number of "positive" and "negative" items. Researchers have 
tended to accept this advice and routinely proceed with data gathering. In 
à study reported by Ory (1982), the use of positive or negative items for rat- 
ing faculty instruction had no effect on the student ratings. Studies reported 
by Benson and Hocevar (1985), Benson and Wilcox (1981), Campbell and 
Grissom (1979), Schriescheim and Hill (1981), and Wright and Masters 
(1982) provide us with appropriate caution and suggest that the instrument 
developers should pilot different versions of an instrument to ascertain if the 
ratio of positive and negative item stems is associated with different reliabil- 
ity and validity information. Recall that in the section on latent trait models, 


Wright and Masters (1982) found that the positive and negative attitude 
e the same construct. In the Benson and 


toward drug items did not measur l 
Wilcox (1981) study, all positive, all negative, and balanced instruments 
were randomly administered to 622 grade 4-6 students to measure attitude 
toward a court-ordered integration plan. For this age level, the mixed form 
Was associated with lower alpha reliabilities at all grade levels (total group: 
mixed = 0.65; other forms = 0.78) suggesting that the grade 4-6 students 

m. Also, it was found that 


responded more inconsistently to the mixed fort it 
Younger (grade 4) students marked significantly more positive responses for 


the positive stems than the other forms and more negative responses for the 
negative form. On the other hand, the older students (grade 6) responded 


equally across all three forms. 
i The later study by Benson 
ese data by examining the factor structure 


ased research on the response set labeled 
dency" has been reviewed by Nunnally 
ortant response set given his three 

He does conclude, however, that 


and Hocevar (1985) furthered the analysis of 
(construct validity) of the three 
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forms; different factor structures were found. Combining the results of both 
studies led Benson and Hocevar to recommend that attitude instruments for 
elementary school students should use all Positively stated items. 

Related to this issue of Positive and negative item stems is the tendency to 
respond in a similar manner to adjacent (i.e., prior) items—this has been 
called a proximity effect. Schurr and Henrikson (1983) discovered a proxim- 
ity effect for a low-inference classroom observation scale for one sample of 


respondents. Reece and Owen (1985) extended the work of Schurr and 
Henrikson by examining the exist 


different low- and high-inference 


mate effects across the instrumen 


needed to examine the circumstances that produce large proximate effects. 
It should be clear at this point that the existence of a large proximity effect 


would most likely result in contaminated factors during a study of construct 
validity and in inflated estimates of alpha internal 


alt with the acquiescence 
response set. 


Semantic-Differential Scales 


The semantic differential is a technique that Scal ° 
called scales anchored or bounded оп each end "id қақ а set of ш 
rated target is called a concept and appears at the top of ке . x 
sample concept and scale is presented in Appendix A. Set of scales. 
The development of the era differential tec 
Charles Osgood (1952, 1962) and is described ; қ 
Меахитетет of Meaning (Osgood, Suci, and Tannen ook entitled > 
title implies, Osgood’s research focused on the scientific st а 957). As the 
and the meaning of words. The assumption was that Es ы of language 
munication takes place through adjectives. That is, we s uc 
teachers as good or bad, fair or unfair, hard Or easy; Side 
strong or weak; and school subjects as useful ог esa 


i ei à useless, va 
less. Theoretically, the semantic differential Scales bound 


hnique is credited to 


of our com- 
only describe 
5 fast or slow, 
luable or worth- 
€d by these bipolar 
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T: d 
able3-4. Typical Semantic Differential Bipolar Adjective Pairs 


Evaluative Potency Activity 
чини с large-small fast-slow | 
рна е ЈЕ А strong-weak active-passive 
positive. np easant rugged- delicate excitable-calm 
кве. negative heavy-light busy-lazy 
зар our thick-thin quick-slow 

aluable-worthless hot-cold 
kind-cruel 

happy-sad 

nice~awful 

honest- dishonest 

fair-unfair 


a straight line or geometric semantic space. 
he semantic space and, as several 
ric space. When individuals 


решени can ђе represented as a | 
е scales pass through the origin of t 
Scales, they form a multidimensional geomet 
Tate a concept оп а scale, they аге effectively differentiating the meaning of 
the concept. That is, they are expressing the intensity and direction of affect 
they feel is associated with the bipolar adjective scale in relation to the 


targeted concept. 
, In Osgood's (Osgoo 
(.e., target objects) were rated (i. 


Sets of bipolar adjectives. After collapsing t 
Cepts, a 50 x 50 matrix of intercorrelations of scales was generated so that a 


factor analysis could be performed. The purpose of this analysis was to 
identify the minimum number of orthogonal dimensions necessary to pro- 
Vide a parsimonious description of the relationships among the scales. In 


Other words, the aim was 10 explore the common meanings of the adjectives 
ncepts (1.6.. 


d et al., 1957) original work 20 different concepts 
e., differentiated) by 100 people using 50 
he data across people and con- 


across the 20 different co : the measurement of meaning). As a 
fied several dimensions ofthe semantic 


result of these studies. Osgood identi к ; А 
Space. Three consistently identified dimensions were: evaluative, potency, 


andactivity. Examples of the bipolar adjective pairs defining the dimensions 


are li i -4. 

d early work in 1957 several researchers have used the 
Semantic differential technique to scale people with respect to affective char- 
bu iM fields of psychology and education most researchers have 
concentrated on the evaluative dimension as a measure of attitude toward 
thestated concept Interested readers are referred to an excellent volume by 


Snider and Osgood (1969) which contains discussions of the theory and 
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development, as well as illu: 
differentials. 


Analogous to the Likert tec 


Strations of several applications of semantic 


hnique, equation 3-1 of Fishbein's 


journal: 
ales. 


veral subdime 
Stepisto Select ab 


the use ofa 7-step scale whereas a S-step Scale m 
elementary students (Osgood et al., 1957, p. 85) үзе 
including examples аге then Written 


rm of the instrument is 

veloped by Pappalardo 
€ as Guidance Counselor" is presented 
65, several of which аге 


ney à Bes 
scales. Y апа activity marker 
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Analysis of the Semantic Differential 


een constructed it is necessary to pilot 
bout 6-10 times the number of 
dbe very similar to the group for 
r instructions and clear under- 
eve data of good quality. 


a the semantic differential has b 
D je on a representative sample of a 
eli s die used. The pilot group shoul 
eyes. e future research is targeted. Clea 
n "ng ofthe process are necessary to achi 
ise the pilot data, the next step is to conduct a factor analysis as well as 
and reliability analyses.* The factor analysis (see chapter 5) will identify 

the dimensions measured within the set of scales; the item analysis and 
reliability analysis (see chapter 4) will further assist in determining which 
items best relate to the identified dimensions as well as the alpha reliability 
of the dimensions. If the sample size is smaller than desired, you may wish to 
Tun the item analysis first to weed out items with high or low means and low 


Vari 4 » Жо» ж 
ariance prior to the factor and reliability analyses. These techniques are 
re 25 scales were factored for 


уза іп Pappalardo's (1971) study, whe s 
61 educational counselors and two evaluative dimensions were derived. 
Factor 1 (alpha reliability = .94) was called “Counseling Које" as it was 


defined by such scales as valuable-worthless, sharp-dull, and good-bad; 
Factor II (alpha reliability — .90) was defined by such scales as insensitive- 
Sensitive, unpleasant-pleasant, and unfair—fair and was called “Facilitative 
Role.” In another study, reported by Gulo (1975), 676 university students 
tated the concept “Professor” оп about 50 bipolar adjectives. Subsequent 
factor analysis generated eight dimensions, some of which were as follows: 

eaching Dynamism (interesting, colorful, progressive, and active), Ac- 


ceptance (positive approving, optimistic, sensitive, and motivated), and 
Intellectual Approach e, and direct). The point of these 


(objective, aggressiv ' 
examples is that if only а few scales are included, and they are not 
homogeneous in meaning it is possible that the resulting factor structure 
Would produce factors defined cales to generate adequate re- 


by too few s 
liability levels. Thus, it is important to carefully construct the set of bipolar 
adjective scales so that the clusters 


of homogeneous scales result in the 
desired dimensions in the factor analysis. m f 
The semantic-differential technique employs the same criterion of in- 
a > Н 3 
ternal consistency as the Likert technique for item selection. To meet the 
Criterion. a scale (i.e item) must correlate well with the attitude score. The 


resulting item characteristic curves or tracelines are linear in nature as illus- 
trated by the Likert technique tracelines presented earlier in figure 3-3. 
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Scoring 


ad- 
ions the semantic differential will be dep edem 
earch sample. At this point, it is essentia пбх 
Core the form, as this area has caused much e gested 
and misuse by users or the semantic differential. Lynch (1973) suggest" 


ng: 
Semantic differentials may be scored by s 7) 
mean scores on each scale, mean scores on each dimension, an ompare 
Statistic. The me each scale technique has been used to c 

two concepts (How I See M 


d ә) опа 
yself versus How My Teachers See Me) 


where D is the dis 
Sents the squared 


tance between the pro 
bipolar adjective < 


: у БЕ 
difference In the ratin Wo concepts on the / 
Summation indi 


» used to rate the twO 
‘Presents the sum of the square 
Profile simil eae ly used as a ря 
Profiles). Lynch (1973 mentions ese чү 1-©., small values mean c 
nique has been used for a i рат (awo on eto 
The dependent variables formed in this manner is on o iit. des 
identification (Myself Versus My Teach ji Presented suc 
How I'd Like to Be), 


ers), izat; 1505 
апа empathy (Но idealization (Myself ve 
ple versus How I Feel 


w I Feel ; Peo- 
about Nonhandicappeq Pog out Handicapped 
In summary, the se ic di ; 
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valuable technique for scaling people on the evaluative dimension of mean- 
ing since the ratings can be utilized as a generalized index of attitudes toward 
2. concept targets. It is essential, however, that the scales be carefully 
. to represent the desired evaluative dimensions and that a sensible 
se? l E used to score the resulting data. Many researchers are con- 
e. "d the results of a semantic differential and dismiss the procedure 

he problem is not the procedure but the development and under- 


Standing of its application. 


Fishbein's Expectancy-Value Model: Anillustration 


d how the expectancy-value model provides 
tude scaling techniques. An instrument de- 


veloped by Norton (1984) provides an interesting illustration of how the 
model can be operationalized using a modified version of Likert’s technique. 
The Sports Plus Attitude Scale (SPAS) was designed to measure the attitudes 
toward physical education of grade 5-8 students. The first step in developing 
the SPAS involved identifying the attributes relevant for student attitudes 
toward sports. A review of literature, as well as an open-ended question- 
naire that asked students about their likes, dislikes, and beliefs with respect 

to physical education, provided the input for developing the statements. 
A pilot study was then conducted where 129 grade 5-8 students first 
7-point bipolar (1.е., good-bad) evaluative 


evaluated each attribute (e) ona ipo! à 
dimension which was bounded by the adjectives "good ' (7) and “Бай” (1) 
and included the descriptors “rather good" (6), “slightly good” (5), “don’t 


know" (4), “slightly bad" (3), and "rather bad" (2). The next step involved 
obtaining measures of belief strength (b) which represented the probability 
that the target object (і.е., physical education) had the stated attribute. To 
obtain the belief probabilities Norton (1984) developed another rating from 
which contained modified versions of the statements used for the initial 
evaluations of the attributes. These statements were rated on a 7-point scale 
which ranged from “а тее” (7) to “disagree” (1) and included the descrip- 
tors “mostly agree” (6). “slightly аргее б), “don't know” (4), “slightly 
disagree” (3), апа «mostly disagree (2). Associated with each of the 7 
points on the belief scale was a probability that the target object had the 
attribute. For the 7-point agree—disagree scale the probabilities were as 
follows: 1.00, 83, .67, 50, 33, 16, апа 0. Table 3-5 contains the five 
statements used for the Physical Education scale in their evaluative and 

ity) forms. For example, the students were asked to rate 


belief (ї.е., probabil 
"endurance" ОП the good-bad scale (е) and the statement “Бу running 


In this chapter we have describe, 
the framework for standard atti 
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Table 3-5. Sports Plus Attitude Scale 


Physical Scales 


Evaluative Attributes Belief- Probability Statements 


Б 5 illincrease 
Endurance By running during Physical education, I willinc 

my endurance, ; 
Physical Fitness Physical education does not improve my fitness. ( 


Speed When I play games during gym, I run faster. 
Sports Skill Skill drills do not make me a better player. (—) 
Strength IfI take gym, I will getstronger. 


?Statements followed Бу(-)м 


е 
Norton ( 1984). 


; 2 scored. From 
те negative statements which were reverse scored. Ег 


| | ducation, I will increase my endurance” on the belief (b) 
scale. 


© toward physical education (i.e., the target object) 
Was estimated by mul udent’s evaluation (е) of each attribut 


andy chart developed by 
On using the ©xpectancy-value technique. 
oint good-baq evaluation (e) rating scale: 
7-point agree-disagree beli pelicfiprobability values associated with the 
matrix, consider two Students 
and “slightly agrees” (5) with the be 
tion, I will in 

4.69 for this s 


$5 endurance as “good” (7) 
nning during physical educa- 
receives an attitude score of 
t attitude for this particular 
Slightly bad" 


(3) and belief of “mostly 
of only .48, 
» the item level at 
о 


lief “by ru 


titude Scores are summed across the 
exami: truct validity of the 


| р ехрі irm the 
€ instrument Afte sss is 
io Т the SPA contained three 
nal, and Social Alpha reliabili- 

: Alpha relia 
sound to be 75. 77 and .77, respectively. 
rsion of the SPAS included adding new 


ties for these three scales were 
Later development of the final 
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items to define a fourth factor. 
In summary, this Section 


which were developed by T 
(Osgood et al., 1957), and 


\ ( pect to а target object. Each technique 
resulted in an attitude Score by summin 


8 the product of an evaluation of the 
favorable—unfavorablenesg of the attribute stated in the item and the per- 


on to 
her hand. the 1, as having no relati d 
differential techniques recorq the Шы latent trait, Likert, and semantic 
andinclude the item in the cali ent with 


a nepati ^ item 
: bration 1а negatively stated i 
A final difference described amon Process leadi 


8 the techni 08 to the attitude score. 

characteristic Curves or tracelines resulting ion ae Yas the different item 

tion employed. om the method of item selec- 
In selecting an attitude scali 


bipolar evaluative 

ariabl alfective characteristic. 

and interest. Ss described in chapters 1 

normative) of vi now turn to а consid- 

i ich j alfective į t 

ner In which items are Presented im. sue виа tha 
Strument. 


concept, Values, 
eration of the Properties (ipsative ог 
result from the man 
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Ipsative Versus Normative Scale Properties 


In the previous sections we discussed ways for scaling items and measuring 


people. Related to this scaling issue is a decision regarding whether the 
instrument will have ipsative or normative measurement properties. This is 
a fairly serious decision because the ipsative and normative properties are 
associated with important practical and psychometric differences. In the 
sections that follow we will examine these differences for ipsative and then 
normative scales. The definition of each type of scale will be followed by 
examples of well-known instruments. Finally, we will discuss the practical 
and psychometric implications associated with ipsative and normative 


Scales. 


Ipsative Measures 


Cattell (1944) employed the term ipsative to describe forced-choice mea- 
Sures that yield scores such that an individual’s score 15 dependent on his/her 
own scores on other scales, but is independent of and not comparable with 


the scores of other individuals (Hicks 1970). The term ipsative (Latin ipse 


= he, himself) was chosen to represent scale units relative to the individ- 
ual’s scores on the other scales on the instrument. This is the case since 
any individual’s scores across the scales on the instrument will always sum to 
the same constant value. When this is the case the instrument can be termed 
"purely ipsative." Variations on the ipsativity of instruments can result in 
“partially ipsative" instruments, but this discussion will focus on "purely 
Тая miae e Schedule (Edwards, 1959) is a good 


The Edwards Personal Preferenc: ; vards. 
Saba. forced-choice format that results in an ipsative measure. The 


EPPS consists of 135 distinct items presented as 225 pairs of items or 450 
separate items which yield 15 scale scores. Each scale is defined by nine 
different items. Eight of the nine items are used three times and one item 
is used four times to yield à total score for each individual of 210 and a 
maximum scale score o 28. Respondents select one item from each of the 
225 item pairs and one point is given to the scale represented by the item. 
The use of triads on the Kuder Occupational Interest Survey (Kuder and 
Diamond, 1979) and Gordon s Survey of Interpersonal Values (1960) illus- 
trates another forced-choice format. In the triad case, each item represents a 

ument. One of the three items is selected by the 


i instr 
different scale on the 118197. | | 
respondent as the most applicable (e.g. an interest inventory) orimportant 


(e.g., а value-orientation instrument) and one of the remaining two is 
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and fak- 
ability (i.e., giving socially acceptable, usually positive, responses) and 
ing (i.e., intentionally un 


because it is quite difficult f 


(Nunnally, 1 


І 
al knows about himself/herse 
difficult for г, 


"t uite 
uantifying these areas has proven to be q 
€searchers. In s 


ight 
attempts are a step in the m 
Тее components of social de 
Пу. 
SURE to develop instruments that are free pen 
responses, Onsider for example Gordon’s (196 
Values (SIV), whic employs 30 triads to measure 
l values: Support, Conformity Recognition, 
cer the Leaders ip. Respondents are asked 19 
c, statement that ig “most important” а 
businesses have used the STV to Screen applicant Way of life. Several па 
so that potential leaders canbe identified. nts for managerial posi ma 
on the SIV cannot eliminate Socially desirable the forced-choice eiim 
claim consider the two SIV Profiles Presente, Š responses. To support a 
level teacher education Students took the ІП figure 3-4. Twenty senio 
labeled “honest.” A 


S expected, the TOspective at resulted in the profile 
high on Support and Benevolence: and t 


as typical achers were found to be 

4 х іс; е 

also high оп Independence. During the same session estes еа 
udents were 


the statement that is "least im 
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30 


254 


204 


Scale 
Mean 


' в Ë 


Figure 3-4. Honest and faking profiles for SIV data from 20 education students. 


new SIV forms and asked to respond as if they were applying for a job in the 
competitive business world and wanted to give the appearance of a “good 
manager candidate.” The “fake” condition profile clearly indicates that the 
group is now highest on Leadership followed by high emphasis on Recogni- 
tion and Independence. Thus, when people are threatened or simply desire 
to present a particular image. even the forced-choice format cannot elimi- 


nate faked responses. 


Practical Considerations 


ve measures results from the fact that the sum 


m with ipsati 5 
согев across all scales is the same as any other person's. 


The implications of this from a practical (e.g., counselor’s) view point аге 
different from those for the test developer. The counselor can use such 
ipsative scores to compare a person's score on one scale with that person's 
score on another scale but should not compare the scores with other people's 
scores in а normative sense since high scores on one scale for an individual 
must result in low scores on another scale if the scales add to the same total 
score. At the same time the counselor must realize that the scoring system 


A serious proble 
of each person's S 
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counselors who find 
So that discussions follow easily. 
This point сап be ; 


figure 3-4. The Six 


amount 
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types: those People hi i 


ans; 
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) > e Correlations 
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among the six Scales reported in the SIV manual (Gon 
1960). Note that © Several Negative Correlations which, Gordo 
пора аге "due to the interdependence among th Iting in the 
forced-choice format” ( 5 8 the scales resu 
A related Proble: 

using regression а 
multiple correlati 
specified criterion 


m lies in the area of valida; Р ment 
À validatin е instru 
nd mul ple Lilien 8 Scores on th 


се 0 
"k. : Procedures. Тһе presen the 
d ipsis UMS ор va idity COVariances between 
and ipsative Variables & 


: ess 
quals zero сап result in sleepl 


3 
SCALING AFFECTIVE CHARACTERISTICS 61 


T 
able 3–7. Intercorrelations Among SIV Scales? 


S (& R I B L 
as —.09 4ü -з -2 === 
Bero mity —.38 -.38 39 —.45 

gnition -.30 -—37 =й 
Independence —.44 06 
Benevolence і -.41 
Leadership š 


“From SIV manual (Gordon. 1960). 


fact, Clemans says that the ability to 
ased by deleting one variable from 
6 But as convincing as Clemans’ 
st data is yet to ге analyzed. 


ia for the ipsative test developer. In 
the s a specified criterion is not incre 
m psative set (Clemans, 1966, pp- 30-33). 

gument is mathematically, much actual te 

Another psychometric restriction of ipsative instruments is the inappro- 
Priateness of factoring the scales on the instrument to examine construct 
Validity (Clemans, 1966; Guilford, 1952). Since the aim of factor analysis is 
to generate dimensions ОГ constructs which describe parsimoniously the 
Scale covariations, it is not methodologically sound to factor a matrix whose 
entries partially reflect the psychometric virtues of ipsative scales rather than 


the true conceptual interrelationships among the scales. In fact, Clemans 
he ipsative covariance matrix 


(1966) illustrates that under some conditions, t 

contains the same “уагіапсе information” as the residual matrix that results 
from taking out the first centroid from a normative intercorrelation matrix. 
If the information gained or lost can be measured by variance, the fact that 
the first centroid, like the first principal component, accounts for maximum 
test variation clearly indicates the problem of “variance information” miss- 
ing in an ipsative matrix. 

The area of reliability has also been noted as a problem for ipsative 
measures (Hicks, 1970). Studies by Scott (1968) and Tenopyr (1968) suggest 
that the interdependence of the scale scores resulting from the ipsative 
technique can seriously limit the alpha internal consistency reliabilities of 


the ipsative scales as compared to normative scales. 
In summary, we have described some of the properties of ipsative scales 


resulting from the use of forced-choice paired comparison and triad formats. 
While the forced-choice format may assist in controlling for response bias in 
some situations, froma practical and psychometric viewpoint, several prob- 
lems have been identified that suggest much caution in and justification for 


the use of such scales. 
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Normative Measures 


Normative scales are more popular than ipsative scales since they are gener- 
ally easier to construct and tend to present fewer practical and psychometric 
problems. On a normative scale, individuals respond to each item separately 
(i.e., no pairs or triads) with the result that there is no fixed total score across 
the scales included in the instrument. That is, individuals could have all high 
scores or all low scores across the scales, since the only restriction is set by 
the possible score range associated with the response format. 

Examples of normative scales are Super’s Work Values Inventory (1970), 


Coopersmith’s Self-Esteem Inventory (1966) and the Gable-Roberts Attitude 
Toward Teacher Scale (1982). 


Practical Considerations 


d 


From a practical viewpoint normative instruments can be easier to develop 


an individual has all 


dents will display variability 


uch that most respon- 
It is also easier for respondents 


dividual wishes them to be for 
threatening and the goal is en 
faking should be low. 


T pu гро: 


hanced self- Se. If the situation is non- 


un ; 
derstanding, the presence of 


Psychometric Considerations 


over ipsative ones. The si š ments 
р е simple reas ents are clearly preferred 
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validity evidence gathered 
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through correlational, regression, or factor analytic techniques is not hin- 


dered by the restrictions set by ipsative properties. That is, obtained rela- 
tionships among variables can be considered to reflect relationships among 


the concepts being measured; they are not partially due to the ipsative 


Scoring system. 
In summary, normative scales can be easier to construct and allow one to 


avoid the measurement problems associated with ipsative scales. 


An Illustration 


ative and ipsative forms of a set of occupa- 
B). Note that the footnote in table 3-8 
ins 9 items measuring 3 occupational 


McMorris (1971) developed norm 
tional values items (see Appendix 
indicates that each instrument conta 
values areas: Favoritism (FA V), Intellectual Laziness (LAZ), and Greed 
(GRD). Also included are the item/scale assignments. з 

Appendix В contains the normative form for the 9 items, which are rated 
оп a 5-point Likert scale where | = “unimportant” and 5 = "very impor- 
tant.” Note that individuals actually rate all items using any points on the 
Likert scale; the scale score is formed by summing the responses to the 
appropriate items (e.g., FAV = item 1, 6, and 8). | 

Тһе ipsative scale was developed using triads where one item from each 
Scale is included in the triad. For example, in the first triad, items 10, 11 and 
12 represent the FAV, LAZ, and GRD scales, respectively. Respondents 


Were asked to select a "most" and "least" important statement for each 
*least" responses were scored 2, 1, and 0, 


triad. The “most,” "blank," апа“ А : 
respectively. Thus, consistent. with the forced-choice technique, respon- 
dents carefully consider the three statements but are only allowed to indicate 
the one that is “most” important. A final question was also presented after 

to quantify the extent that socially desir- 


the three triads in a quick attempt 
able res as could be present. | => 

ponses ¢ P students responded to both forms. A EE 
eans, standard deviations, and cor- | 


Forty undergraduate education 
the scale scores were generated, m iral ; 
ТЕГ апе calculated (ог display іп (һе multitrait-multimethod matrix 
Presented in table 3-8. While Campbell and Fiske’s (1959) original discus- 
sion of the multitrait-multimethod matrix was based upon using different 
methods of measurement (i.e.. paper survey and peer ratings), displaying 
the correlations іп this manner facilitates our discussion. Employing the 
in chapter 4, we see some interesting pat- 
nderlined values in the diagonal represent 


Strategy to be discussed further 
terns in the correlations. The under 
Itimethod or same trait measured by two 


validity coefficients (homotrait mu 


Ánpiquarsoq Ipipo$—qƏ5OS 
(шопвләипшән [еп 1201) po215 — о) 
ssoumze јепзо још — ZVT 
кк шзпполед—Д Vd 


72-0 PƏ1o3s ‘spen ¢—(aanesdy) | Pd 
7<-| PƏ1o3s цәрә *әүгэ$у$шәп С—(гапешлом) ү мед :ƏlON 


(бр) 07 0% AVAN 


QMOI ZV'Tl ЛУЗІ амом ZV'IN AVAN as HERNI 


О = N 
suo1v24407) рир 'suoiiaaqq рари? ‘sua 


ÁiojueAu|senjeA|euonednooQ `g-€əllqe l 


3 SCALING АЕЕЕСТІУЕ CHARACTERISTICS 65 


methods). At first glance these validity coefficients seem low, but each needs 
to be considered in light of the alpha reliabilities for the respective scales, 
which appear in parentheses in the main diagonal. The low number of items 
used for both the normative and ipsative scales appears to have resulted in 
low reliability levels, except for the normative GRD scale. Later, in chapter 
5, we will note that the maximum validity coefficient is the square root of the 
product of the reliabilities of the two scales. For example, while the cor- 
relation between IFAV and NFAV was .41, the maximum correlation pos- 


sible was V/(.45)(.45) ог .45. On the other hand, the maximum correlation 
possible between the NGRD and IGRD scales is approximately .64 (i.e., 
V(.75)(.55) ) and the correlation reported is only .36. Thus, for these scales 
the normative and ipsative measures using the same items are not highly 
related. We also noted that the diagonal validity values are higher than their 
row and column counterparts in the dashed-line triangles (multitrait- 
multimethod). But it is difficult to interpret the values in the dashed tri- 
angles, since they partially represent the ipsative scales which reflect both 
the occupational values content and the ipsative scale properties. 

The solid triangles contain the scale intercorrelations (multitrait- 
homomethod) for the normative and ipsative forms. For the normative form 
it makes sense that those individuals with high scores on Greed tended to 
also score highly on Favoritism (ғ = .60), while the Intellectual Laziness 
scale was found to have low correlations with Favoritism (r = .17) and Greed 
(r = .18). While these correlations make conceptual sense, the correlations 
from the ipsative scales in the solid triangle are negative and make little 
sense. This is consistent with Cleman’s (1966) statements reviewed earlier 
regarding ipsative measures. PX 

Finally, the correlations between the scale scores and the indication of 
social desirability (SOCD) (item 19) are presented in the bottom row of 
table 3-8. Consistent with the opinions of proponents of ipsative measures, 
the ipsative version appears to result in scores that are less related to socially 
desirable responses in this nonthreatening situation. — 

In summary, this example using normative and ipsative forms for the 
same set of items has illustrated some of the attributes of ipsative scales. 
While the validity coefficients were low due to low reliability levels, the 
reliabilities of the ipsative and normative scales were generally similar and 
the extent of variation due to socially desirable responses appeared to be 
ve scales. The primary point illustrated was generation 


lower for the ipsati : Ë Е z 
lations resulting from the ipsative scoring sys- 


of negative scale intercorre апо i | ' i 
tem. We can thus conclude again that, given their extensive practical and 
psychometric limitations, ipsative scales should be developed with caution 


and careful justification. 
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Summary 


This chapter has described selected techniques appropriate for scaling items 
and measuring the affective characteristics of people. Also presented was a 
comparison of ipsative and normative scales. We turn now to the important 
topic of examining the validity of affective instruments. 


Notes 


‘Note that you should have some jud; 


stems. It could be that a stem d be perceived as negative by the 


liability of any scale containing this 
?Note that previously we have used the term * 


particular instrument. For the seman 
Osgood et al. (1957) and use the term ** 


'scale" to represent a cluster of items on a 
tic-differential tec! 


apter 4 regarding the relationship of content 
and construct validity. B 8 P 


У Кеса! that to generate a correlation matrix the соуагіапсев between the variables are 
normalized (divided by the square root of the product of their Separate variances) to get the 
correlation. That is, 


РЕР 
common practice has been to delete one of the 


А H прави = the scale ictors 
in multiple regression, in order to reduce the B Scores as predi 


multicollinearity problem. 
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4 THE VALIDITY ОҒ АҒҒЕСТІУЕ 
INSTRUMENTS 


The investigation of the validity of an affective instrument addresses the 
general question: “Does the instrument measure what it is supposed to 
measure?” Contrary to the thinking of some researchers, a test is not cer- 
tified once and for all as “valid.” Rather, the investigation of validity is an 
Ongoing process. The process continually addresses the appropriateness of 
the inferences to be made from scores obtained from the instrument (see 
Cronbach, 1971). That is, validity focuses on the interpretations one wishes 
to make for a test score in а particular situation. As stated in the Standards 


for Educational and Psychological Tests, 
Validity is the most important consideration in test evaluation. The concept refers 
fulness, and usefulness of the specific inferences 


to the appropriateness, meaning! £ x š 
lidation is the process of accumulating evidence 


made from test scores. Test validal t 
to support such inferences. A variety of inferences may be made from scores 


produced by a given test, and there are many ways of accumulating evidence 
to support any particular inference. Validity, however, is a unitary concept. 
Although evidence may be accumulated in many ways, validity always refers to 
the degree to which that evidence supports the inferences that are made from the 
scores. The inferences regarding specific uses of a test are validated, not the test 


itself. (American Psychological Association, 1985, p. 9) 
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Validity Evidence 


Arguments for validity are based upon two types of evidence: judgmental 
and empirical. The judgmental evidence is generally gathered prior to the 
actual administration of the items to the target group and consists mainly of 
methods for examining the adequacy of the operational definition of the 
affective characteristics in light of its conceptual definition. The empirical 
evidence is argued after the instrument has been administered to the target 
respondents so that relationships among items within the instrument, as well 
as relationships to instruments measuring similar or different constructs, can 
be examined with respect to the theory underlying the variables measured. 
In this chapter we will describe the three commonly identified types of 
validity: content, construct, and criterion-related. Techniques for gathering 
appropriate judgmental and empirical evidence for each type of validity will 
also be described and illustrated. Since this text addresses affective charac- 
teristics, emphasis will be placed upon content and construct validity. 


Content Validity 
Definition 


Content validation should receive the highest priority during the process of 
instrument development. Unfortunately, some developers rush through the 
process with little appreciation for its enormous importance only to find that 
their instrument “does not work” (lack of construct validity or internal 
consistency reliability) when the response data are obtained. The impor- 
tance of content validity can be seen when its definition is considered in light 
of the conceptual and operational definitions of the affective characteristics 
presented in chapters 1 and 2. 

According to Cronbach (1971), content validity is assessed by answers to 
the question To what extent do the items on the test (instrument) adequately 
sample from the intended universe of content? We know from chapter | that 
underlying any affective characteristic is a theoretical rationale and concep- 
tual definition that describes the universe of possible items or content areas 
to be included in the instrument. Given the conceptual definitions, the 
developer adopts development procedures that generate the operational 
definitions described in chapter 2. The responses obtained from administra- 
tion of the instrument then reflect these operational definitions and are 
used to make inferences back to the conceptual definitions underlying the 
affective characteristics. Thus, unless the instrument developer carefully 
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addresses the process of content validation, interpretation of the resulting 
data will most likely be meaningless (i.e., lack validity). 


Evidence 


The evidence of content validity is primarily judgmental in nature and is 
mostly gathered prior to the actual administration of the instrument. Two 
primary areas become the focus of the evidence of content validity: the 
conceptual and the operational definitions of the affective characteristics. 


The theoretical basis for the conceptual defini- 
tions is developed through a comprehensive review of appropriate litera- 
ture. Instrument developers must specify and summarize their literature 
base in the technical manual. This is essential, since the evidence of content 
validity revolves around judgments regarding the universe of content from 
which the instrument developers have sampled in developing the instru- 


ment. 
It is recommended that a panel be established consisting of about five 


content experts with professional expertise in the area of the affective char- 
acteristic under consideration. Appropriate experts could be university pro- 
fessors and/or graduate students from the education or psychology areas. 
It is essential that the experts be thoroughly grounded in the literature 


representing the affective characteristic. 

The panel of experts should be provided a bibliography and summary of 
the literature used as a definition of the universe of content. Individually or 
as a group, the experts can then review the materials and comment on the 
adequacy of the conceptual definition of the affective characteristic as it 
relates to the proposed use of the instrument. Simple rating sheets could be 
developed for this task so that the experts could rate and comment on such 
areas as comprehensiveness of theory and adequacy of sampling from the 
content universe. Following this review of the theoretical base, the oper- 


ational definitions can be addressed. 


Conceptual Definitions. 


Operational Definitions. The operational definitions discussed in chapter 2 


are the vehicles by which the developer samples from the universe of content 
specified by the conceptual definition. It is essential that the operational 
definitions be reviewed by these same five content experts and their assess- 
ment reported in the Technical Manual as evidence that the sampling of 
items adequately reflects the intended universe of content. Inadequate sam- 
pling will necessarily lead to invalid inferences regarding test score inter- 
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Table 4—1. Sample Content Validity Rating Form 


Instructions. The statements that follow are being considered for inclusion in a 
(identify name of the survey) survey. Please assist us in reviewing the content of the 
statements by providing two ratings for each statement. The conceptual definitions 


of the categories these statements are supposed to reflect as well as the rating instruc- 
tions are listed below. 


Categories Conceptual Definition 
I.Name of category Definition 
II. Name of category Definition 
III.Name ofcategory Definition 
IV.Name of category Definition 
RATING TASKS 


A. Please indicate the category that each statement best fits bycircling the appropri- 


ate numeral. (Statements not fitting any category should be placed in Category 
V.) , 


B. Please indicate how strongly you feel about your placement of the statement into 


the category by circling the appropriate number as follows: 
3 no question about it 
2 strongly 


1 notvery sure 


Statements Categories Rating 
1. (list statements here) I W Ш IV V 1 2 $ 
4 I H ІШ IV V І 2 9 


. I H ш IV V 1 2 3 
(Continue for allitems) 


pretations. Put simply, 
ceptual definitions is ess 
It is proposed that th 


correspondence between the operational and con- 
ential for content validity. 

judgmental data be obtained in two ways. First, the 
panel of five experts can be given the operational definitions such as those 
contained in tables 2-1 and 2-2 in chapter 2. Through group or individual 
discussion and possibly ratings, the correspondence between the conceptual 
and operational definitions should be ascertained. 

A second judgmental rating exercise can be carried out to examine the 
extent to which the items truly reflect the content categories specified in 
developing the operational definitions. This judgmental evidence is best 
collected from a larger group of content experts. It is suggested that from 
15-20 experts be used who are somewhat knowledgeable in the area of the 
affective characteristic. These experts will not be judging the comprehen- 
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Table 4-2. Hypothetical Content Validity Ratings for Two Statements 


Statement 1 H HI IV V 
15 Presents the subject so that it F 1 18 1 
сап be understood. % 5 90 5 
Mean? 2.90 
1 Motivates students to learn. Р 19 1 
% 95 5 
Mean 2.78 


(continue for all items) 


а Note that the mean rating is calculated for the category with the highest percentage only. 


Table 4-3. Ranked Mean Ratings for Category Il 


Ranked 
Statement Mean Rating 
15 Presents the subject so that it can be understood. 2.90 
2.78 


1 Motivates students to learn. 
(continue for other items) 


not have to have the same high 


siveness of the literature review so they will 4 
Typically, а dedicated group of 


level of credentials as the prior set of experts. 


graduate students or teachers would be ideal judges. күз . 
Table 4-1 contains a sample rating form to be used in this exercise. The 


form begins with instructions regarding the rating task and then lists the 
definitions of the categories such as those illustrated from table 2-2. The 
judges are then asked to (1) assign each item to the category it best fits; and 
(2) to indicate how strongly (comfortable) they feel about their assignment 


of the item to the category. 


Table 4-2 contains a hypothetical display of data obtained for items 15 


and 1 from the Gable-Roberts Attitude Toward Teacher Scale (see table 
4—6). For each item the frequency and percentage of assignment to each 
category are listed. A criterion level of 90% agreement is recommended for 
an item to remain in a particular category without revision: After the high- 
percentage category is identified, the mean “comfort” rating 18 calculated 


for only the items assigned to that category. | 
The final step consists of listing the item stems ІП the order of their ranked 


“comfort” ratings, as illustrated in table 4-3. This information summarizes 
the best items within each category as judged by the content experts. On the 
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basis of the information gathered through the rating procedure, items can be 
i r added. 

S айы we have described content validation asa judgmental Lane t 
should be noted that some fairly complex empirical procedures are i 
available. In general, such procedures consist of an analysis of the ш 
expert’s sorting of the items into any number of mutually шокы сыт у 
categories. These judgmental data аге then analyzed [0 ascertain a er ve 
underlying meaningful content categories that reflect the judges or p 

items. The analysis technique, called latent partition analysis (Wiley, 1 Ë 
creates a joint-proportion matrix where each entry indexes the еа 
sorters who placed а given pair of items in the same manifest category. Т B 
matrix, which is similar to a correlation matrix, is then analyzed to see i 

there exist latent categories that explain variation in the proportions. (The 
latent categories can be roughly thought of as “factors” derived from xs 
ing a correlation matrix.) The latent categories are named on the basis o 

their item content and the results are used to further refine the content of the 
items. The feature of such an analysis is that the resulting instrument can be 
administered to a target group so that a factor analysis of actual response 
data can be carried out. The beauty of this approach is that the “judgmental- 
ly” derived content categories of the content experts can then be compared 
to the “empirically” derived constructs of the respondents. Since the goal of 
construct validity, to be presented in a later section, is to provide iner 
interpretations of the scores on the instrument, it is often the case that the 
interpretation of the constructs derived through the factor analysis is greatly 
facilitated by studying how the content experts sorted the items. Readers are 
referred to two studies which illustrate how the content and construct valid- 
ity information contributes to the overall validation process; Gable and 
Pruzek (1972) and Coletta and Gable (1975). 

This section has emphasized the importance of establishing the corre 
spondence between the conceptual and operational definitions associated 
with the affective instrument. Techniques were described for gathering in- 
formation to defend the argument that the items developed adequately 
sample from the intended universe of content being measured by the instru- 
ment. It was noted that the content validity argument is mostly based upon 
judgmental data. Given that arguments for content validity were convincing 


and appropriate revisions in the instrument have been made, the basis of the 
argument now switches from judgmental to empirical data during construct 
validation. 
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Construct Validity 
Definition 


Construct validation addresses the question To what extent do certain ex- 
planatory concepts (constructs) explain covariation in the responses to the 
(9 Whereas the content validity argument focused on 
arding the adequacy with which the test items 
nt universe, the construct validity 
data variation among items to 
ntent categories actually reflect 


items on the instrumen 
experts’ judgments гер; 
reflected specified categories in the conte 
argument focuses directly on response 
ascertain evidence that the proposed co 
constructs. These constructs (or concepts) have been previously specified 
through the conceptual and operational definitions of the affective charac- 
teristic (see chapter 1 and 2). The argument that the instrument actually 
measures the construct is only successful when relationships among the 
items (operation definitions) comprising the instrument, as well as rela- 
tionships with specified variables from other known instruments, exist in a 
manner judged to be consistent with the conceptual and operational defini- 


tions. Thus, construct validation is an ongoing process of testing hypotheses 
regarding response data relationships for the items (or scales) of the de- 
veloping instrument and other, known, instruments. The sections that fol- 
low will describe the empirical evidence relevant for arguing the case for 


construct validity. 


Evidence 


red by administering the instrument to 


Evidence of construct validity is gathe s r 
Ле instrument was designed. 


a representative sample of respondents for which the in : I 
Empirical analyses of these data are then carried out in the midst of theoreti- 


cally based logical arguments regarding the existence of meaningful con- 
structs. Four popular analysis techniques will be described and illustrated: 
correlations with other variables, the multitrait-multimethod analysis, fac- 
tor analysis, and the known-groups procedure. While all of the techniques 
аге essentially correlational in nature, they will be described separately for 


ease of understanding. 


Correlation 


The most commonly employed statistical technique for examining construct 
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validity is correlation (r). Unfortunately, many researchers fail to uncover 
the richness of this strategy because their analysis strategy is not grounded 
in the theory underlying the variables. Some researchers continually list 
correlations between their target instrument and another instrument with 
no statement regarding the meaning of the correlation. For example, a re- 
searcher may state that instrument A correlates .20 with instrument B 
with no explanation of whether this magnitude of relationship is supportive 
of construct validity on the basis of the theory underlying the two vari- 
ables. Many graduate student Ph.D. proposals and theses merely contain 
lists of correlations with other measures with no arguments supportive 
of construct validity. 

Simply put, all arguments for construct validity must be based upon 
theories underlying the variables in question. This amplifies the importance 
of clear conceptual definitions generated during the process of instrument 
development (see chapter 1). It is from these conceptual definitions and 
their theoretical base that hypotheses are generated regarding the traits 
measured by the instrument and those from other known instruments. 
Clearly, the hypotheses need to be stated in advance of gathering the data so 
that the resulting data can be seen to support or fail to support the proposed 
relationship (see Carmines and Zeller, 1979; Cronbach and Meehl, 1955). 
An example from the area of work values will illustrate this point. 

Chapter 1 included a brief description of Super’s Work Values Inventory 
(1970). While that instrument was in the prepublication stage, Gable (1970) 
revised and added selected items to the 15 WVI scales and studied the 
construct validity of the revised WV/ using a sample of 503 grade 11 students 
from three school districts. The literature on work values was first reviewed 
to develop a good understanding of the work values concept and also to 
identify a list of what Cronbach (1971) has called the other “known indica- 
tors.” These known indicators represent other well-known instruments 
assessing constructs theoretically related to work values. The Edwards Per- 
sonal Preference Schedule (Edwards, 1959), Kuder Preference Record (Kud- 
er, 1951), Survey of Interpersonal Values (Gordon, 1960), and the Study of 
Values (Allport, Vernon, and Lindzey, 1960) were identified. Normative 
versions of selected scales from the EPPS and SOV and the complete KUD 
and SIV measures were employed. In addition, Super’s stated relationship 
of work values with aptitude, achievement, social class, and sex were ex- 
amined. Aptitude was measured by Differential Aptitude Test (Psychological 
Corporation, 1982) scores, achievement by the prior year’s grades in the 
content areas. Finally, social class was categorized using the Warner, Meek- 
er, and Eells (1949) 7-point scale based on head-of-household occupation 
(1 = highest, 7 = lowest); and sex was coded male = 1 and female = 0 


Table 4-4. An Illustration of Examining Construct Validity Using Correlations 


Instrument 


Direction of 
Hypothetical 
Relationship 
Scale with WVI-Altruism — Correlation? 


Edwards Personal 
Preference 
Schedule 


Kuder Preference 
Record 


Survey of Interper- 
sonal Values 


Study of Values 


Differential Apti- 
tude Test 


Achievement 


Social Class 


Зехр 


Achievement 
Affiliation 
Autonomy 
Change 
Dominance 
Nurturance 


Outdoor 
Mechanical 
Computational 
Scientific 
Persuasive 
Artistic 
Literary 

Social Service 
Clerical 


53 


co*tooocooooóo +фооо + о 


Conformity 
Recognition 
Independence 
Benevolence 
Leadership 


.62 


Theoretical 

Economic 

Aesthetic 

Social 

Political 

Verbal Reasoning 
Numerical Reasoning 
Abstract Reasoning 
Mechanical Reasoning 
Space Relations 


49 


= сссос croco сезсе 


English ai 
Math 0 
Science 0 
Social Studies 0 


- -37 


*Only correlations gre 


ater than .10 have been included; decimals have been omitted. 


^ Sex was coded male = 1 and female = 0. 


80 INSTRUMENT DEVELOPMENT IN THE AFFECTIVE DOMAIN 


Table 4-4 illustrates the use of correlations to examine construct validity 
for one of the 15 WVI scales labeled Altruism. Super (1970) had defined 
Altruism as an orientation toward helping people. Teachers and Peace 
Corps volunteers would most likely have high Altruism scores. Prior to 
gathering the actual data, hypotheses were generated on the basis of the 
theory underlying Altruism and selected scales from the other known mea- 
sures and variables (e.g., achievement). Table 4-4 contains a listing of the 
measures/constructs used for analysis of the Altruism scale. Each construct 
label is followed by a +, 0, or — sign to indicate the direction of the hypoth- 
esized relationship suggested by the appropriate theories underlying the 
variables. Also included are the obtained correlations for the 503 grade 11 
students. Given the large sample size, emphasis was not placed upon statis- 
tical significance. Rather, the focus became the direction and magnitude of 
the relationships in light of theoretical expectations. 

It is quite clear that the Altruism scale correlated as predicted with sever- 
al other known measures and variables. For example, people with a high 
score on the WVI Altruism scale tended to exhibit personality profiles with 
high Affiliation (r = .47) and Nurturance (r= .56) on the EPPS, and high 
interest in Social Service activities (r= .53) as measured by the Kuder. 
Further, they emphasized the interpersonal value of Benevolence (r = .62) 
on the SIV, exhibited a general value orientation with high emphasis in the 
Social (r = .49) area on the SOV, and tended to be female (r = —.37). Also, 
as hypothesized by Super, levels of Altruism tended not to be related to 
aptitude and achievement measures or with social class. Thus, evidence 


uct validity of the WVZ Al- 
ressed in a construct validity 
garding the interpretation of 
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Multitrait-Multimethod Matrix 


The multitrait-multimethod (MTMM) matrix technique has become an 
increasingly popular way for examining construct validity. Originally intro- 
duced in 1959 by Campbell and Fiske, the technique is essentially a system- 
atic way of analyzing correlation coefficients. Today’s use of computers to 
analyze large data sets gathered from multiple instruments has contributed 
to increased use of the technique as is evidenced by the increased number of 
journal articles reporting results using this analysis strategy. The technique 
appears difficult to understand at first as it appears to become a semantic 
game. Actually, the technique is quite clear and indeed powerful for ex- 
amining construct validity under certain conditions. We will first describe 
the rationale for the technique along with its associated vocabulary. Follow- 
ing this, the analysis strategy will be described and some exemplary studies 


found in the literature will be discussed. 


b 


e (1959) employ the terms convergent and 
and these terms, consider a new instrument 
On the basis of the theory underlying the 
her scores on the new instrument should 


relate to scores on the known instrument (convergent validity). For exam- 
ple, in the previous section we predicted that the new Work Values Inventory 
(WVI) Altruism scale should correlate positively with the Survey of Inter- 
personal Values (SIV), Benevolence scale (convergent validity), but should 
not correlate with the SIV Conformity scale (discriminant validity). That is, 
depending on the situation, construct validity can be supported by either 


high or low correlations. 
Campbell and Fiske (19. ( 
а trait-method unit. That is, the (гай ; 
assessed by а particular method (e.g., self-report rating scale). Тһе те- 
sulting score received оп the Altruism scale then reflects variation due to 
the trait Altruism and variation due to the self-report measurement. When 
we generate correlations between different scales (traits) using different 
measurement methods (e.g.. self-report and teacher rating), we observe 
systematic variation (correlation) in the scores which is due to the trait being 
measured and the measurement tec ` р ЕК 
The goal of the MTMM technique is to estimate the relative contributions 
of the trait and method variance to the respective correlation coefficient. To 
achieve these goals, we need more than one trait measured by more than one 
method. For example, Campbell and Fiske (1959) illustrate the КШ 
using (һе traits of Courtesy, Honesty, Poise, and School Drive assessed by 


Rationale. Campbell and Fisk 
discriminant validity. To underst 
that you are trying to validate. 
instrument you can predict whet 


59) also describe the score on an instrument as 
being measured (e.g.. Altruism) is 


hnique. 
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Table 4-5. Sample Multitrait-Multimethod Matrix for Three Traits and 
Two Methods 


Method 1 Method 2 
Traits А, В, с, A; B; с, 


A, 
(Peer Relations) (R) 
Method 1 B, 


(Self Report) (Physical Abilities) (R) 


Method 2 (Peer Relations) Va cS EE (R) 


B; Ee МИН 
р 2 
(eer Rating): руа Abiliiies) prone ei (R) 


I 
р I 
(School) L (R) 


two different methods: peer ratings and self-report. (Readers should note 
that the Altruism, Benevolence, and Conformity example presented earlier 


illustrated the concepts of convergent and discrimi idi 
minant vz t 
really fit the MTMM framework d ui. 


because onl Я | 
employed.) nly one method, self-report, was 


Analysis Strategy. Table 4-5 contains 2 i i 
Ж е E ns a MTMM shell which will be used 


ategy for estimating the relativ ibuti f trai 
: e € contributions of trait 
and method variance along with convergent and disc 
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are, in fact, similar to the correlations we discussed in the earlier section 
entitled Correlation). Finally, we have correlations between different traits 
using different methods. These values are found in the two dashed-line 
triangles and are referred to as heterotrait-heteromethod (HH) values. Now 
that readers are thoroughly confused with semantic labels, we will proceed 
cautiously to analyze how the MTMM display assists in separating trait and 
method variance. 

First, the entries in the reliability diagonals should indicate that a suf- 
ficient amount of reliable variance (say at least .75) is present in each mea- 
sure. Second, the entries in the validity diagonal indicate convergent validity 
e correlation of the same trait using two different 
methods. These entries should be significantly high and consistent with 
theoretical expectations so that it is worthwhile to probe further to deter- 
mine how much of the relationship was due to the trait versus the methods 
employed. Third, each respective validity diagonal value (i.e., Уд) should 
be higher than the correlations in the adjacent HH triangle (dashed-line 
triangle). That is ће АА» correlation should exceed the АВ; and АС 
correlations since the B> and С; variables һауе neither trait nor method in 
common with А). When the AA; for Va correlation is higher than the АВ 
апа А |С; values, we can say that the magnitude of the А, Аҙ validity coef- 
ficient is largely due to shared trait variance and not to variation shared 
between methods 1 and 2. As noted by Campbell and Fiske (1959), this rule 
may be common sense but is often violated in the literature. Fourth, the 
correlation of a variable should be higher when a different method is used to 
measure the same trait than when the same method is used to measure 
different traits. Looking at the validity diagonal entry Va and the HM solid 
triangle entries АВ, AiCi, and В,С,, this means that Уд, which reflects 
the same trait (A) and different methods (1 and 2. should have a higher 
value than correlations generated from any combination of variables, which 
reflect different traits but the same method (i.e., А1Ві, А! Сп, and B,C)). If 
this were not the case, we would know that much of the magnitude of Va 
(A Ao) resulted from common variation due to method and not trait. A fifth 
guideline suggested by Campbell and Fiske (1959) is that the same pattern of 
relationship be exhibited in each of the HM and HH triangles. 

While this analytical strategy may appear confusing at first, readers are 
encouraged to study the vocabulary carefully so they understand how the 
estimates of trait and method variance contributions are determined. If one 
is seriously developing а пе he study of convergent and as 
criminant validity to estima d variance is quite important 
as one seeks to determine t ons of the constructs under- 


lying a score on the instrument. 


since they represent thi 


w instrument, t 
te trait and metho 
he fine interpretati: 
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Illustrations. To thoroughly understand the MTMM technique, readers 
are encouraged first to carefully read Campbell and Fiske’s (1959) article. 


Following this, some of the studies listed at the end of the chapter can be 
reviewed. 


Factor Analysis 


Earlier in this chapter we noted that construct validity addresses the ques- 
tion “To what extent do certain explanatory concepts (constructs) explain 
covariation in the responses to the test items?” So far we have discussed two 
empirical techniques for examining construct validity, each based upon 
generating correlations between selected variables. For the first technique 
we simply correlated the two variables and examined their hypothesized 
relationship with respect to the theory underlying each variable. In the 
second technique, we displayed the correlations in a multitrait-multimethod 
matrix so that we could estimate the amount of trait and method variance 
present in the correlation coefficients. We will now continue the empirical 
analysis and turn to the relationships among the items on the instrument to 
ascertain if there exist constructs that help us explain the covariation among 
the items. If meaningful covariation among items exists, the clustering of 
items to form scales on the instrument will be supported. Consider the set of 
her Scale (GRATTS) items displayed in 
ument contained 22 items. If a research- 
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Table 4—6. Gable-Roberts Attitude Toward Teacher Scale Items? 


The teacher in this course: 
1. motivates students to learn. 


2. likes his (her) job. 
3. is fair in dealing with students. 
4. iswilling to help students individually. 
5. issuccessful in getting his (her) point across. 
6. isinterested in students. 
7. lacks enthusiasm. 
8. triesto cover too much material in too short a time. 
9. assigns too much homework. 
10. does not make learning fun. 
ll. isgenerally cheerful and pleasant. 
12. disciplines too strictly. 
13. is not interesting to listen to in class. 
14. hasasense of humor. 
15. presents the subject so that it can be understood. F 
16. is too structured. 
17. 15100 busy to spend extra time with students. 


18. fails to stimulate interest in the subject matter. 
19. isnotinterested in his (her) work. 

20. does not evaluate student work fairly. 
2]. likes students. 

22. tests too frequently. 


а Underlined items are negative item stems and should be reverse scored. 


be made and then the 
erstood and properly interpreted within the 


context of examining the existence of hypothesized вше, pire 

readers are referred to Tabachnick and Fidell s excellent ook o 

multivariate statistics for more advanced reading.) | ЕУ" 
Previous instructional experience, 25 well as reviews o 


and journal articles, indicate that the technique of factor 1 а 
employed but not always well understood. For this reason consi erable 
emphasis will be placed on factor analysis as a technique for examining 


construct validity. 


to running the analysis, certain decisions have to 


output has to be clearly und 


tor analysis is 10 examine 
Pu tegy. The purpose of fac sis is 
«чөе en ships among the items and to identify clusters of 


empirically the interrelation ГІНЕ запас a 
items that share sufficient variation to justify their cu det a кзн 
construct to be measured by the instrument. That is, the factor analys 
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nique examines the item-level intercorrelation matrix. Using some complex 
mathematical procedures beyond the scope of this book, the procedure 
“decomposes” the correlation matrix into a set of roots (eigen values) and 
vectors (eigen vectors). These roots and vectors are then appropriately 
scaled (multiplied together) to generate a matrix usually called a factor 
loading matrix. Whereas the correlation matrix has the same number of rows 
and columns as items contained in the instrument, the loading matrix con- 
tains the same number of rows as there are items and number of columns as 
there are factors derived in the solution. The entries in the matrix represent 
the relationship (usually correlations) between each item and the derived 
factor.! The instrument developer lists the items defining each factor and 
ascertains if the empirically identified cluster of items share common con- 
ceptual meaning with respect to the content of the items. If the items clearly 
share some conceptual meaning, this concept is described and referred to as 
a construct measured by the instrument. If successful, it would be unneces- 
sary to inform the researcher that the GRA TT measures 22 different things. 
Rather, the three or four constructs measured by the instrument could be 
discussed. Thus, factor analysis is a data reduction technique in that the 22 
items could be described as measuring three or four constructs. The aim of 
factor analysis, then, is to seek parsimony in the description of the instru- 
ment (i.e., to employ a minimum amount of information to describe the 
maximum amount of variation shared among the variables). 


Relationship of Constructs to 
ly note that the develo 


€ case, the instrument's 
two types of problems could 
5 did not generate any concep- 
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some meaningful constructs. While this is certainly a better situation, the 
lack of correspondence of the constructs to the judgmental categories needs 
to be examined. Again, the ideal situation is a clear correspondence be- 
tween the content and construct validity studies. Falling short of this goal is a 
signal for problems in the future use of the instrument. 


Exploratory and Confirmatory Analyses. As we have described it, factor 
analysis is used to examine the relationships between the judgmentally de- 
veloped content categories and the empirically derived constructs. As such, 
we are actually testing hypotheses regarding the interrelationships among 
the items. In this regard we can consider the factor analysis to be confirma- 
tory rather than exploratory. In an exploratory analysis, one simply enters 
the items into the analysis and describes the resulting factors. A confirma- 
tory approach is more applicable to instrument development in that the 
developer examines the derived constructs in light of theoretical predictions 
that follow from the literature review and operational definitions of the 
targeted categories specified during the content validity process. While 
some factor analytic procedures (¢.g., Jóreskog's maximum likelihood fac- 
tor analysis as described in Jóreskog and Sórbo, 1984) directly involve 
hypothesized factor structures, our use of principal component analysis and 
common factor analysis in this volume will provide a sufficient vehicle for 
examining the existence of hypothesized interitem relationships. Readers 
should note that the new LISREL VI computer program, while difficult to 
run, has become a popular program for running confirmatory factor analy- 
ses. See Hocevar, Zimmer, and Strom (1984) for an example. 


r to running the factor analysis program, the research- 
er has to decide what entries to insert in the diagonal of the correlation 
matrix, the criterion for the number of roots to be selected, and the method 
for rotating the derived factors. The researcher has a choice of what values to 
insert in the diagonal of the correlation matrix prior to the factoring. The 
particular choice made reflects how the researcher wishes to deal with the 
variances of the items. That is, the total variance (Ут) of an item theoret- 
ically consists of common (Vc) and unique (Vu) variance such that 
Vr = Ve + Vy. Common variance is that portion of the total variance that is 
shared with the other items; unique variance 1s that portion that is uncor- 
related or not shared with the other variables. The values selected for the 
diagonal of R (i.e., the correlation matrix) are estimates of the i 
variance shared by the particular item and all the other Шоты Ба, a ini | 
communality estimate). We should appreciate that the c ан of these di- 
agonal values consumed many years of research by psychometricians. 


Initial Decisions. Prio 
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Essentially, the decision boils down to two choices, each implies a different 
application of the mathematical model. The first choice is to leave the 1’s in 
the diagonal of the correlation matrix and carry out a principal-component 
analysis.? Operationally, the 1'5 represent the amount of variance for each 
variable which has been entered into the factor analysis.? The use of 1'5 in 
the diagonal means that all of the variance (i.e., common and unique) for 
each variable has been entered into the analysis and no distinction is made 
between common and unique variance in that they are operationally merged 
together. The second approach consists of inserting squared multiple cor- 
relations (SMC’s) into the diagonal of the correlation matrix prior to the 
factoring procedure. These SMCs represent an estimate of how much of the 
variance of each item is shared with the set of remaining items (i.e., common 
variance). The SMCs are known to be good initial estimates of the common 
portion of variance in each item. Thus the procedure attempts to distinguish 
between common and unique variance and is called a common-factor analy- 
sis. Some rescarchers favor this approach since they feel that much of the 
error variance (noise) in the analysis has been removed. Operationally, the 
resulting factor structure from the principal component and common-factor 
analyses will most often be quite similar. Developers should run both and 
compare the results. 

The second decision pertains to the criterion for the number of factors to 
be extracted from the solution. Two commonly followed practices are possi- 
ble. The first criterion employs a procedure developed by Guttman (1953) 
and popularized by Kaiser (1958) and is known as *Kaiser's criterion." 
When 175 have been inserted in the diagonal for 
analysis, all factors with eigenvalues (roots 


(i.e., the unity criterion), are retained (Rummell, 1970). The rationale for 
this criterion is that we know that the contribution of each item to the total 


variance in the solution is 1 (i.e., its value in the diagonal of Б). Also, we will 
later illustrate how summing th ) 


А ( е squares of the factor loadings (correlations) 
associated with each factor tells us how much variance the factor accounts 
fop in the solution. Thus, the unity criterion specifies that only factors 
S па as much уагіапсе as а Single item should be retained (Com- 
rey, - А second criterion is employed wh Н і 

I $ en SMC’s ha 5 а 
into the diagonal of R (i.e. ei RENDANT init 


à principal-component 
) greater than or equal to 1.0 


principal-component 
instrument developers 
Ids a more meaningful 


analysis employing the unity 
should try both procedures a 
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solution in relation to the content categories examined in the content valid- 
ity phase. 

The final decision pertains to the choice of rotation procedures. (Rotation 
of the factor matrix will become clearer when the illustration is presented in 
a later section.) To understand the need for rotation of the factor matrix, we 
must realize that the underlying factor structure that provides the parsimo- 
nious description of the correlation matrix exists in a geometric space. That 
is, the clusters of items can be geometrically presented (i.e., each item is a 
point in a geometric space), but we must find a solution (i.e., a reference 
system) that allows us to view the clusters in a clear and meaningful manner. 
The factors in the initial factor matrix are actually axis systems (x and y axes) 
in the geometric sense and can be rotated to obtain a clearer picture of where 
the clusters of items (i.e., points in space) are in the geometric space. The 
closer we can get a cluster of items to a particular axis, the more the cluster 
will contribute to the naming of the factor. To understand this point, imag- 
ine yourself standing in the corner of a room. The three-dimensional room 
actually has an x, y, and z axis. Pretend there is a cluster of about 10 ping- 
pong balls (items) hanging from the ceiling forming the shape of a large 
ellipse. To name the cluster you need to assign it in the simplest two-dimen- 
sional case to the right wall (y axis) or the floor (x axis). Since the ellipse is 
not really near the wall or floor, you envision rotating the room so that the 
ellipse gets closer to the wall and further away from the floor. Note that you 
kept the wall and floor at right angles. In another situation, you might have 
two ellipses, neither of which are really near the wall or floor. The optimal 
way to get the floor and wall nearer to the two ellipses may be to allow the 
wall and floor to have less than a right angle (і.е., an oblique angle). " 

When we rotate a factor matrix in an attempt to locate clusters of items 
and keep the axes at right angles, we have performed an 
a varimax rotation (Kaiser, 1958). This procedure 
ndependent or not related. An oblique rotation 
at the derived factors are not independent 
ecommended that instrument develop- 


ers run both varimax and oblique rotations to see which results іп a more 
meaningful solution. The principal of rotating the factor matrix will become 
clearer when actual data are used to illustrate the technique in the next 


section. А ! жауа фай d 
A final comment regarding varimax and oblique rotations is in order. 
Some researchers have become confused after running а varimax rotation— 


they operationally state that their derived factors are now uncorrelated. It is 
the axis system that remains uncorrelated. When, in order to name a factor, 


nearer to an axis 
orthogonal rotation called 
keeps the axes (factors) i 
allows the axes to collapse so th 
but correlated to some extent. It is r 
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one selects those items correlating above, say, .40 with the factor, the Ке 
ing scale scores formed by summing responses to the selected items are а 
orthogonal, but usually have moderate correlations. Thus, the derived ad 
tors are orthogonal, but the scale scores used to describe the factors are no 
orthogonal. 

Another point of confusion to many researchers is the use of the term 
factor scores. In most cases developers select items that load highly on the 
factor and sum individuals’ responses to the items to generate "scores on the 


factor." It is an error to Teport these scores as factor Scores; these scores 
should always be referred to as Scale scores ог “ 


the factor." True factor scores 


discussion of this point. 


Computer output is often 
п > Package will be discussed. 
© items, which Were rated on a 5-point 
Tee, 4; uncertai 


m n, 3; disagree, 2; strongly 
numbers indicate ive ite 

; пера 5 

which were reverse scored (i.e., 5 = 1:4-2;2-4. gative item stem 


Sisso that high scores would reflect Positive Аш. э Бү fo n 
tered to 695 grade 11 students who rated six social studies els ET es 
4-7 contains the factor analysis output.4 For instructional u esi ne ill 
proceed through the output and illustrate how the factor Р "poses we Wi 

tion can be used to examine construct vali analysis informa- 


dity. Note t 
output page numbers appear on the Tight side of the oe the computer 


Utput and will be 
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referenced in the sections that follow. 


Item Stems. Page 2 of the output presents the item stems and the SPSS 
procedure cards needed to run the principal-component analysis followed 
by varimax and oblique rotations. Note that negative item stems have been 
reverse scored by the recode statement on page 2 of the output. 


Means and Standard Deviations. The item means (after reverse scoring of 
negative item stems) and standard deviations on page 3 of the output seem 
typical for such attitude items, and sufficient variability in responses appears 
to be present. If the means were near either extreme on the response scale 
and the standard deviations were low, the resulting correlations between 
items would tend to be low and hinder a meaningful factor analysis. 


The interitem correlations (pages 3-4) indicate 
that several items share moderate amounts of variation, but one cannot 
identify clusters of items that appear to relate to each other while not re- 
lating to other items not in the cluster. Clearly, such an eyeball approach 
(inneroccular procedure) to factor analysis would be difficult, especially 


with a large number of items. 


ce Accounted For. We turn next to page 5, 
which lists the eigenvalues and the percent of variance accounted for by each 
derived factor in the solution. The eigenvalues or lambdas (A) represent the 
roots referred to earlier. Employing the unity criterion results in three fac- 
tors being extracted. The size of the root is directly related to the importance 
of the derived factor since the sum of the roots will equal the total amount of 
variance entered into the factor analysis. This amount of variance is actually 
the sum of the diagonal entries in the correlation matrix that was factored. In 

alysis uses 1’s in the diagonal of 


this example, the principal-component an 
the original correlation matrix. These 1’s represent the amount of variance 


for each item that is entered into the analysis. Thus, the sum of the diagonal 
entries in the correlation matrix, R (i.e., the trace of R), equals 22 (the 
number of items) and represents the total variance in the solution. (In a 
common-factor analysis squared multiple correlations are inserted in the 
diagonal of R so that the trace of R is less than the number of items.) Since 
the sum of the roots equals the sum of the diagonal entries in R, the amount 
of variance accounted for by each factor prior to rotation is obtained by 
dividing the root by the trace of R. Thus, in our example, 9.98 + 22 = .454 
indicating that 45.4% of the variance has been accounted for by Factor I 
prior to rotation. Also, note that the output indicates that the three-factor 


Interitem Correlation. 


Eigenvalues and the Varian 


AIN 
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Table 4-7. SPSS¥ Princi 


pal Components Analysis Followed by Varimax and 
Oblique Rotations 


E 2 
17 JUN 85 FACTOR ANALYSIS AND RELIABILITY PAG 
17:39:04 UNIVERSITY OF CONNECTICUT 


ІВМ 3081 D MVS SP1.3.0 
= 2=4 
4 RECODE V7TO V10 V12 V13 V16 ТО У20 V22 (5-1) (4-2) (3-3) (2-4) 
(1-5) 


5 VARLABELS VI MOTIVATES STUDENTS TO LEARN/ 
V2 LIKES JOB/ 
V3IS FAIR IN DEA 


LING WITH STUDENTS/ 
V4 IS WILLING TO 


HELP STUDENTS INDIVID./ 
V5IS SUCCESSFUL IN GETTING POINT ACROSS/ 
V6IS INTERESTED IN STUDENTS/ 

V7 LACKS ENTHUSIASM/ 

12 


V8 TRIES TO COVER TOO MUCH IN A SHORT TIME/ 
V9 ASSIGNS TOO MUCH HOMEWORK/ 


° УТО DOES NOT MAKE LEARNING FUN/ 

15 V11 IS GENERALLY CHEERFUL AND PLEASANT/ 
16 У12 DISCIPLINES ТОО STRICTLY/ 

17 V13 IS NOT INTERESTING ТО LISTEN TO/ 

18 V14 HAS A SENSE OF HUM R/ 

19 


VI5 PRESENTS · CAN BE UNDERSTOOD/ 
V16 IS TOO STRUCTURED, 

УІ? IS TOO BUSY TO SPEND TIME WITH STUDENTS/ 
V18 FAILS TO 


20 
21 
22 
23 
24 


25 V21 LIKES STUDENTS; 
26 V22 TESTS TOO FREQUENTLY, 
27 FACTOR VARIABLES = v1 ТО V22/ 

28 PRINT=DEFAULT UNIVA 'E 

m nome ТАШЫ, МАТЕ CORRELATION/ 
30 ROTATION = OBLIQUE/ 

31 PLOT=EIGEN/ 


THEREARE “83912 BYTES OF MEMORY AVAILABLE 
THE LARGEST CONTIGUOUS AREA HAS 83912 BYTES 
>NOTE 11284 


MMAND IS NOT US 
ABLES ON THE VARIABLES 
>SUBCOMMAND WILL BE USED FOR THE FIRST ANALYSIS 
THIS FACTOR ANALYSIS REQUIRES 


42756( al 8K 
8K) By 
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ANALYSIS NUMBER 1 LISTWISE DELETION OF CASES WITH 
MISSING VALUES 


MEAN STD DEV LABEL 
Vil 3.55396 1.13481 MOTIVATES STUDENTS TO LEARN 


v2 4.03165 .01807 LIKES JOB 

V3 3.88633 .11708 IS FAIR IN DEALING WITH STUDENTS 

V4 — 3:92518 00008 IS WILLING ТО HELP STUDENTS INDIVID. 
У5 3.78417 10254 IS SUCCESSFUL ІМ СЕТТІМС POINT АСКО55 
V6 3.91079 02734 IS INTERESTED IN STUDENTS 


V7 4.01439 10008 LACKS ENTHUSIASM 
TRIES TO COVER TOO MUCH IN A SHORT TIME 


1 
Т 
1. 
1. 
ЈЕ 
V8 3.51223 1.28595 
У9 413237 1.18213 ASSIGNS TOO MUCH HOMEWORK 
VIO 3.59856 1.27320 DOES NOT MAKE LEARNING FUN 
УП 3.80432 1.17820 IS GENERALLY CHEERFUL AND PLEASANT 
V12 4.04604 1.06322 DISCIPLINES TOO STRICTLY 
V13 3.68777 1.26994 IS NOT INTERESTING TO LISTEN TO 
У14 3.98273 1.05592 HAS А SENSE ОЕ HUMOR 
VIS 371511 1.12836 PRESENTS SUBJECT ... CAN BE UNDERSTOOD 
У16 3.81151 1.13136 15 ТОО STRUCTURED 
V17 3.8274 1.09181 Is ТОО BUSY ТО SPEND TIME WITH 


STUDENTS 


FAILS TO STIMULATE INTEREST IN SUBJECT 


УІН 3.67914 1.16471 5 
У19 4.05180 1.08841 IS МОТ INTERESTED ІМ HIS (HER) МОКК 
V20 376835 1.12202 DOES NOT EVALUATE STUDENT WORK 
FAIRLY 
V21 3.96978 1.08192 LIKES STUDENTS 
V22 4.10072 1.14695 TESTS TOO FREQUENTLY 
NUMBER OF CASES - 695 
CORRELATION MATRIX: 
VI v2 уз v4 v5 V6 V7 
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17 JUN 85 FACTOR ANALYSIS AND RELIA ep 
17:39:17 UNIVERSITY OF CONNECTICUT ІВМ 3081 0 MVS 


——————— ы 5. FACTOR ANALYSIS 
FACTOR TRANSFORMATION MATRIX: 


FACTOR 1 FACTOR 2 FACTOR 3 
FACTOR 1 .60399 .57699 ee 
FACTOR 2 -.76164 -62104 18: БУ 
FACTOR 3 —.23473 —.53046 814 


R 
OBLIMIN ROTATION 2 FOR EXTRACTION 1 IN ANALYSIS 1 – KAISE 
NORMALIZATION. 


OBLIMIN CONVERGED IN 


14 ITERATIONS. 
PATTERN MATRIX: 


FACTOR 1 FACTOR 2 


FACTOR 3 

У] -18694 -.74275 -.15721 
v2 -84449 .03665 -.06275 
уз -32759 ~.33635 .27312 
V4 -57079 -.1914 15875 
уз -24317 -.71271 -.15985 
V6 -51869 -.16407 -32562 
v7 -41545 -.33931 -09288 
V8 -20071 -24171 .65933 
V9 -05565 -.01529 -73990 
V10 -.13731 -.77983 :13915 
viu 29739 —48523 .09596 
V12 .21139 -.08350 60551 
V13 —.18858 —.67379 .23129 
V14 .19315 -.44626 19119 
VIS .01696 — 82687 -.04455 
V16 —.01561 03642 174675 
У17 -29447 -.14101 39165 
vig -.09234 -.71410 12397 
У19 .56287 -.03533 23872 
v20 .18452 —.24426 39022 
v21 .50704 -.14815 29849 
У22 -16351 -09437 


-74139 
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FACTOR ANALYSIS —— eee 


STRUCTURE MATRIX: 


FACTOR 1 ҒАСТОК2 FACTOR 3 
MI 47133 -.74054 133968 
v2 .80249 -.32187 25131 
V3 .59293 -.64474 .59449 
v4 .72531 -.55296 .49676 
V5 51249 -.73524 34224 
V6 72443 -.59156 62472 
v7 .61070 -.58613 „45078 
V8 .17362 -.52328 171730 
v9 35623 -.46240 .77067 
У10 28194 —.79493 152855... 
уп 56198 -.67869 „49009 
У12 .49052 -.52682 „73688 
V13 2177 -.71740 .54001 
V14 47731 -.64525 52179 
У15 38531 -.80943 43281 
V16 .26355 —.38133 171983 
V17 .51563 -.50140 .58870 
У18 129019 -.74155 49379 
V19 .67404 -.43397 .48206 
v20 .45331 -.55250 .60243 
v21 (69459 -.55475 .58391 
М22 41349 -.40394 .75253 
FACTOR CORRELATION MATRIX: 

FACTOR 1 FACTOR 2 FACTOR 3 

FACTOR 1 1.00000 
FACTOR 2 -.46684 1.00000 
FACTOR 3 .39660 —.56917 1.00000 
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solution has accounted for 56.9% of the total variance. 


Number of Factors. The number of factors derived by the solution is oper- 


ationally defined as the number of factors associated with eigenvalues or 
roots greater than 1 (three factors in this case). While the default option in 


most computer programs will use the unity root criterion, researchers can 
also specify the number of fa 


Tce a solution hypothesized by the 
may tend to artificially create the 
intended factor structure, 


3 > actor number (x axis) so that the shape 
of the resulting curve can be examined. The point (factor number) at which 
aightens Out is taken to indicate the max- 
acted in the solution. Cattell suggests that 


ia. Page 6 of the output illus- 


€ that the scree test and the 
€e-factor solution, 


Initial Communality. To th 
a listing of estimated comm 
have values of 1. Commun 
item accounted for by the 


A alue display (page 5) appears 
unalities (EST COMMUNALITY) all of which 


е amou 


Unrotated Factor Matrix. matri 

represents the derived factor loading rix (FACTOR M 
has the same number of rows as ite 
factors. The entries represent cor 
rived factor so we can attem 
the highest correlations. If w 


Ms and сој 
relations mber of derived 
t to na i petween сас item and the de- 
ad SN © factors by loo ing at it = ith 
€ do this, 5 1 

5 We see that the Matrix is not bap 
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interesting or useful since by design the first factor (principal component) 
contains most of the variance (i.e., 45.4%). Since the entries in the matrix 
are correlations, we could generate the sum of the squared entries in Factor I 
and thus see that the value would equal the first eigenvalue, 9.98. 


Final Communalities. Since this unrotated matrix tells us little about the 
content of the factors, we proceed in the output to the bottom of page 7 
which contains a listing (COMMUNALITY) of the final communalities 
n. These values are calculated as the sum of the 
squared correlations in each row of the factor matrix on page 7 or the 
varimax rotated matrix on page 8. For example, .5847 or 58.5% of the 


variance in item 1 was accounted for by the factor solution. 


resulting from the solutio 


Varimax Rotation. We now turn to a focal point in the analysis, the vari- 
max rotated factor matrix (ROTATED FACTOR MATRIX), usually 
labeled F, on page 8. Prior to discussing the entries in the F matrix, we need 
to note what the rotation process accomplishes. Recall that the unrotated 
factor matrix on page 7 contained many high entries in Factor I and the other 
factors contained few large entries. Since we name a factor by describing the 
item content for those items correlating highly with the factor, it is evident 
that the unrotated matrix would always lead to one overall factor deter- 
mined by most of the items and some other generally uninterpretable fac- 
tors. Thus, we rotate the factors in a geometric sense to try and uncover a 
clearer factor structure. To understand the rotation concept, note that two 
factors at a time can be plotted in а geometric space as is illustrated in figure 
4-1. Items loading .40 or above on either Factor I or Factor II have been 
plotted for the unrotated and rotated matrices. In а varimax rotation the axis 
system remains at right angles (90°), which implies that the factors are 
orthogonal or independent. The correlations listed under Factors Тапа Піп 
the factor matrix аге then the coordinates to be plotted. For example, in the 
varimax rotated matrix item 1 has coordinates of .67 “over” and .37 “up” 
with axes I and П, respectively. When we observe the factor structure exhib- 
ited in the unrotated and rotated matrices. it is obvious why the unrotated 
matrix is of little value to the instrument developers as all of the items cluster 
near Factor 1. When the correlations are plotted for the varimax rotated 
matrix, it is clear that items 10, 13, 18, and 15 will contribute most to the 
name of Factor I and items 2, 19, 21, 4, and 6 will contribute most to the 
naming of Factor П. Based upon the plot in figure 4-1. it is not yet clear 
where items 1, 5, 14; 11. % and 3 will be assigned until the loadings and item 
content are considered further. The actual naming of the factors will be 


discussed in a later section. 
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Factor it Factory 


n 
Factor . 


actor 
12345 6 14 9 OF 
Vrretated 


Figure 4—1. Unrotated and varimax rotated factor matrix plots. 


the three-factor solution (i.e., 5 
by summing the Squares of the 


Transformation Matrix. Тһе next matrix (p. 9) is used : = 

d n 
process (TRANSFORMATION MATRIX) апа can be ids "s pm 
strument developer. Впогеа by 


Oblique Rotation. The next section 


i Š ns 

oblique rotation of the unrotated factor matrix. Prior to "es results of an 
sults, we need to present the rationale for COnductin à ш the ге 
orthogonal (varimax) rotations. Іп а varimax rotation que rather than 


the axis system is 


of the Output Contai 
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Factor ll 


Factor 11 


ы 


5-і 21 


4 2 3 4 5 6 J 8 9 10 Factor! 


L ee s 
3 a 3 4. 6 7 919 Factor! 
Oblique Rotation 


Varimax Rotation 


Figure 4-2. Varimax and oblique rotation plots. 


maintained at 90°. The axis system is simply rotated in all possible directions 
so that the x (Factor I) and y (Factor II) axes each get close to a distinct 
cluster of items. In this way, the entries listed under Factor I in the factor 
loading matrix will have high loadings on Factor I and low loadings on Factor 
II; for Factor П the reverse will be true. Refer again to figure 4-1 containing 
the plot for Factors I and П resulting from the varimax rotation. Notice that, 
if we envisioned an ellipse around the cluster of primary items defining each 
factor, we would see that the ellipses are not located directly next to either 
axis system. That is, the ideal axis system for describing the relationship 
between Factors I and II appears to be one with less than a 90* angle. In 

senting the factors is orthogonal, the 


other words, while the axis system repre д 
clusters of items actually used to define the factor are not orthogonal. Since 


this will be the case in most applications in education or psychology, it is 
beneficial to also examine a factor solution that allows the axis system to be 
less than 90°, the oblique rotation. In this rotation the axis system can 
collapse to less than 90° so that each axis becomes closer to the cluster of 
items defining the factor. Figure 4-2 repeats the varimax rotated plot and 

ding oblique rotation for the attitude toward 


includes the plot of a correspon 322 E ; 
teacher data.5 Inspection of the plot indicates that it is now clearer that items 


1 апа 5, and possibly 11 and 14, belong to Factor I. The final decision would 


depend on the content of the items. — 
While the entries in à factor matrix created by an oblique rotation are 
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ions in the 
on w imilarly to the correlations l 
ig treated similarly to Wee tes 
- eights, they сап Бе ted | лд 
Tegressi matrix for purposes of identifying the factor з A ip 4: 
йор s it is по longer appropriate to calculate sums of squ he row 
tion is 


e e i ог some reaso 
i If fors г 
і і аге not correlations. 
S, Since the entries a 
or column еп! те 5 п 
you wanted to calculate the 


factors after rotation, you would have to refer b 


i ing those 
taining regression weights. as n s 
n the factors indicates that the three fa 


d I (see computer nn Um 

ings indi : s Гап 
i: items with the hi 25 indicate that Factors uus 
jede rus àx to the oblique rotation. This is 


2? is define the 

€ rotation. The “real order is defined Тыр 

і і arima 

ariance accounted for after Totation as calculated earlier for the va 
v 


otation. It is recommended that the labels for the Oblique rotation simply 
г E } 
switched to agree with the varimax order. 


Factor Correlations. Тһе final matrix of j 
tween the derived factors and is la 


TS Or axes a. 


for the GRATTS q 


ata are correlated .47. 
Tpretation we Will n 


ote that the — 47 in the 
reversed during interpretation.) i 


lying the factor are not indepe 


ndent Constructs, 
reliability we will see that facto 


In the next chapter on 
T$ correlated .30 to .40 or higher could be 
number of items d 5 
psing make 
i content, higher alpha reliabilities wo 
Шы wonders why people tend to run у; 
rotation tells us if the factors аге related 


5 conceptu; 
ша Tesult. 
arimax rotations when 
While givin 


the oblique 
gus 


à varimax equivalent 
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Table 4-8. Attitude Toward Teacher Factor Loading Matrix Varimax Rotationa 


(N= 695) 
Пет Factor 1 Factor П Factor HI 

1 67 37 

2 80 

3 43 49 39 

4 31 66 27 

5 65 41 

6 31 64 41 

7 40 53 23 

8 36 66 

9 21 24 70 
10 73 13 33 
11 51 46 26 
12 27 38 61 
13 65 в 199 
14 49 37 93 
15 75 25 19 
16 16 16 69 
17 28 43 ks 
18 67 15 30 
19 18 62 30 
20 35 35 48 
21 29 62 38 
22 13 32 $2 

es above .10 have been included and decimals have 


“For ease of interpretation only entri 
been omitted. 


г zero. The reason for the popularity of the 
varimax rotation is that it first appeared in 1958 when Kaiser reported its 
use; reliable oblique rotation programs were not readily available in most 
canned computer programs until the 1970s. There is no problem with run- 

ring the results, but one should always run at 


ning both rotations and compa 
least the oblique to see if the factors are correlated. 


if the factor correlations are nea 


Now that we have reviewed the computer output, 


we are ready to examine the factor structure that summarizes the informa- 
tion in the correlation matrix. For the varimax rotation we interpret the 
VARIMAX ROTATED FACTOR MATRIX and for the oblique rotation, 
the FACTOR PATTERN matrix. The first step is to display the varimax and 
oblique loading matrices. For ease of interpretation enter only those load- 


Factor Interpretation. 
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Table 4-9. Attitude Toward Teacher Factor Pattern Matrix Oblique Rotationa 
(М = 695) 


Пет Factor 1 Factor П Factor III 
1 74 19 —16 
2 84 
3 34 33 27 
4 20 57 16 
5 7 24 —16 
6 16 52 33 
7 34 41 
8 24 -20 66 
9 74 

10 78 -14 14 
11 49 30 

2 —. 21 60 
13 67 -19 23 
14 45 19 19 
15 83 

" 75 
17 14 29 39 
18 71 12 
12 56 24 
20 24 18 39 
21 15 51 55 
> 16 74 


а ог ease of interpretation only entries above .10 have been included and decimals have 
been omitted. Note that the factor order for the oblique rotation was changed to be consistent 
with the varimax solution and that Factor I was reflected (іе. Signs were reversed) 


ings above .10 as illustrated in tables 4-8 
matrixes) underline the com 


has a loading 
! gn the item to both 
efining each factor can assist in 


, temporarily assi 
td 


attitudes, positive item stems, and positive loadi 
example, for a Likert 5-point agreement Scale code a 5 for “str 
to an item such as “This teacher motivates Students to learn.” 
examine the signs of the loadings in the factor i 


ongly agree” 
( You сап then 
matrix to see if the positive 
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item stem was associated with a positive loading. If it is not, you need to 
ascertain why the positive stem has a negative loading. It may or may not be 
a problem. If all of the positive item stems defining the factor (loadings 
above .40) have negative loadings, you can simply reflect the factor (reverse 
all of the + — signs). This merely locates the factor in a different 
quadrant geometrically; it does not change the magnitude of the relation- 
ships among items or factors. Reflecting a factor in the oblique FACTOR 
PATTERN matrix also necessitates reversing all of the signs in the fac- 
tor’s row and column in the FACTOR CORRELATION matrix. Since 
you have now located the factor in a different quadrant, the direction of the 
relationships of the factor with other factors has also been changed. In most 
cases, reflecting all appropriate factors will result in all positive relationships 
in the FACTOR CORRELATION matrix. If within a particular factor you 
find a few negative loadings (і.е., a bipolar factor), look at the item stems to 
see if the content is stated in a negative direction (e.g., This teacher does not 
make learning fun). If you reverse scored all of the negative item stems prior 


to the factor analysis, you now have a problem. That is, your reverse scoring 


based upon judgment did not agree with the direction of the item as per- 


ceived by the respondent. A review of the item stem is in order. If you did 
not reverse score the negative item stems, the negative loading merely 
reflects the negative relationship you have created and it can simply be 
ignored (i.e., changed to positive). 

After identifying the items that load .40 or greater on the factors, create 


tables for the varimax and oblique rotations listing the item numbers, item 
factor as illustrated in tables 4-10 and 


stems, and ranked loadings for each f 
4-11. For the oblique rotation you will also need to create a table of factor 
correlations as illustrated in table 4-12. You are now ready to interpret the 
factors. The items with the highest loadings share the most variance with the 
factor you have derived. Your job is to ascertain just what concepts define 
the factor. Review the item content to identify the underlying theme shared 
by the items. Respondents tended to rate these items in a consistent manner 
on the basis of some conceptual framework; and this consistency is what 
created the intercorrelations among the items and contributed to the de- 
velopment of the factor. What were the respondents perceiving when they 


read the items? 

idity. The prior work in content validity will 
Consider the operational definitions created for 
the affective characteristic as illustrated in tables 2-1 and 2-2. Recall that 
the judgmental categories you built into the instrument were targeted to be 
constructs measured by the instrument. That is, you and the content experts 


Content and Construct Val 
assist in this review process. 


Table4—10. Principal-Component Analysis with Varimax Rotation: 
Attitude Toward Teacher Items? (N — 695) 


Item * 
Number Stem Loading 
Factor 1 р 2 75 
Presentation 15 Presents the subject so that it can be . 
of subject understood. 
10 Doesnot make learning fun. 73 
18 Fails to stimulate interest in the subject 67 
matter. 
1 Motivates students to learn, 67 
13 Isnotinteresting to listen to in class. 65 
5 Issuccessfulin getting his (her)point :65 
асгоѕѕ. 
1 Is generally cheerful and pleasant. 51 
14 Hasasense of humor. 49 
Factor П 
Interestin 2 Likeshis (her) job. .80 
job and 4 Is willing to help students individually. „66 
students 6 Is interested in students. .64 
19 Is not interested in his (her) work. .62 
21 Likes students. .62 
z Lacks enthusiasm. 59 
3 Isfairin dealing with Students. .49 
Factor ПІ 
Teaching 9 Assigns too much homework. 70 
techniques 22 Teststoo frequently. .69 
16 Is too structured, 69 
8 Tries to cover too much material in too .66 
short a time. 
12 Disciplines too strictly. .61 
20 Does not evaluate student work fairly. 45 
17 Is too busy too spend extra time with 44 
students. 


“Underlined item numbers reflect negative stems, which Were reverse scored 


Table 4-11. Principal-Component Analysis with Oblique Rotation: 
Attitude Toward Teacher Items? (N = 695) 


Пет 


Number Stem Loading 
Factor I 
Presentation_ 15 Presents the subject so that it can be 183 
ofsubject understood. 
10 Doesnot make learning fun. .78 
1 Motivates students to learn. 174 
18 Fails to stimulate interest in subject 71 
matter. 
3 Is successful in getting his (her) point 1 
across. 
13 Isnotinteresting to listen to in class. .67 
11 Is generally cheerful and pleasant. 49 
14 Has a sense of humor. 45 
Factor II 
Interest in 2 Likes his (her) job. 84 
job апа 4 Is willing to help students individually. 57 
students 19 Isnotinterestedin his (her) work. 156 
6 Is interested in students. 52 
21 Likes students. 51 
s Lacks enthusiasm. .41 
Factor III 
Teaching 16 Is too structured. 75 
techniques 22 Tests too frequently. 74 
9 Assigns too much homework. 74 
8 Tries to cover too much material in too .66 
С short a time. 
12 Disciplines too strictly. .60 


a Underlined item numbers reflect negative stems, which were reverse scored. 
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Table 4—12. Attitude Toward Teacher F. 


actor Intercorrelation Matrix Oblique 
Rotation (N = 696) 


Factor 1 Factor II Factor Ш 
Factor I 1.00 47 .40 
Factor II 1.00 87 
Factor III 1.00 


specified the universe of content 
rated by the intended users of the 


have been used to develop the empirical relationships among the items. 
These empirical relationships are 


tion: “То what extent d 


to be measured. Now the items have been 


categories and empirical fac- 


interpretation of the derived factors. If the 


. Әу merging some actual 
» describe the Perceptions or attri- 


y on the factor. For Fact 2 he 
description could be: ctor Lin table 4–10 t 


Factor I was called Presentation of Subject as items defining ТАНИР 
the delivery of the lecture as well as the impact of the ге ee. or s m 
Teachers rated highly on the factor would be Perceived by қак тісі on stu ce 
interesting lectures which provided stimulatio S to give clear, 


п and motivation for learning. 
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{ If an oblique rotation was used the final step would be to describe the 
intercorrelations among the factors displayed іп table 4-12. Note that for 
these data the factors are correlated above .40 so the possibility of collapsing 
the factors can be considered. As we noted earlier, the decision to collapse 
any of the factors is based upon the level of reliability of the uncollapsed 
factors and the conceptual meaningfulness of such collapsing. If the factors 
are reliable and clearly meaningful, collapsing is optional. If the alpha re- 
liability (see chapter 5) of the factors is low and collapsing makes conceptual 
sense, you may try and collapse them to obtain a factor with a larger number 


of items and thus higher alpha reliability. 


In the previous section the use of an item- 
or examining construct validity. It is also 
possible to conduct a factor analysis at the scale level. That is, responses are 
summed across items defining the respective scales. The scale-level intercor- 
relation matrix is then submitted to a factor analysis to exainine the exis- 
tence of more global underlying factors. This type of information can add 
greatly to the interpretation of the instrument, especially when the instru- 
ment attempts to assess several somewhat-related constructs. 

Studies using the Tennessee Self-Concept Scale (Fitts, 1965) illustrate 
such scale-level factor analyses. The TSS is a 100-item instrument with 90 
items contributing to a 3 X 5 self-concept classification scheme. The three 
levels of the first dimension represent the individual's internal frame of 
reference; the five levels of the second dimension represent an external 
frame of reference. A potential psychometric problem with the instrument is 
that the scoring procedure employs a 3 х 5 grid, which results in each item 
contributing to both a row (internal) and column (external) dimension of 
self-concept. This situation can lead to spuriously high correlations due to 
the item-scale overlap. In addition to these eight scale scores, four other 
scale scores are generated through various combinations of the items, which 
results in a total of 12 scale scores. 

The construct validity of the 12 scales was examined by Rentz and White 

nd Cook (1973). In each study the scale- 


(1967) and Gable, LaSalle, a | 
factored апа two relatively independent 


level intercorrelation matrix was ly ir I 
dimensions of self-concept, similar to self-esteem and conflict-integration, 


accounted for the scale interrationships. Thus, the scale-level factor analysis 
contributed information that would lead to finer interpretations of the TSS 


Scores. 


Scale-Level Factor Analysis. 
level factor analysis was illustrated fi 


uments. Theconstruct validity ofan instrument can 


Factoring Several Instr f 
: a scale-level factor analysis that also includes 


also be examined through 
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scales from other carefully selected instruments or variables. The intent of 
the analysis is to ascertain if the scales load on factors in a manner consistent 
with theoretical expectations. 

This technique can be illustrated by considering the instruments and 
variables described earlier in the chapter regarding the use of correlations to 
examine construct validity. Complete sets of data for 503 high school juniors 
were obtained for the target instrument, the Work Values Inventory (ММП), 
as well as selected scales from the following instruments: Edwards Personal 
Preference Schedule (EPPS), Kuder Preference Record (KUD), Survey of 
Interpersonal Values (SIV), Study of Values (SOV), and the Differential 
Aptitude Test (ОАТ). In addition, data were obtained for school grades, 
social class, and sex (Gable, 1970). Fourteen of the WV scales were inter- 
correlated with the other 36 scales resulting in a 50 x 50 intercorrelation 
matrix. The purpose of the analysis was to provide a clearer interpretation of 
the factor structure of the WV] by seeing if the targeted scales from the other 


measures would load on factors with the WV] Scales as predicted on the basis 
ofa literature review. 


Table 4—13 contains the loadin 
(i.e., a type of factor analysis) 
was found to be supportive of t 


8. Note that the predicted 
5 and student achievement 
5 defined solely by the DAT 


А value: 
or aptitude was supported in that Factor IV wa 


scales and Factor V by student achievement and DAT scores 
Obviously, the crucial aspect of such an analysis is the selection of instru- 
ments based upon the literature review since the factor analysis is confirma- 


tory in nature. Another key aspect is that obtaining complete sets of data 
across several instruments takes testing time and good sane m sane 
and students. Unmotivated respondents and missing data will боени 
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ruin the analysis. If the data can be collected, the contribution to examining 
construct validity will be worth the effort. 

In summary, this section has described and illustrated the use of factor 
analysis to examine construct validity. It was emphasized that the empirical- 
ly derived constructs should correspond with the judgmentally derived con- 
tent categories developed during the content validity phase of developing 
the instrument. Discrepancies between the derived constructs and content 
categories could be quite serious and indicate need for further developmen- 


tal work. 


bias for cognitive measures has received con- 
e, 1981; Berk, 1982: Hambleton & Swami- 
for detecting test bias clearly frame the 
This is consistent with Cronbach’s 
the soundness of all the inter- 


Test Bias. The study of test 
Siderable attention (see Col 
nathan, 1984). Techniques available 
investigations in the validity domain. 
(1971) statement that "validation examines 
pretations of a test" (p. 433). 

: For affective measures, therefore, 
internal structure of the affective instrument 


ing to Reynolds; 


it is appropriate to study bias in the 
(i.e., construct bias). Accord- 


Bias exists in regard to construct validity when a test is shown to measure different 
hypothetical traits (psychological constructs) for one group than another or to 
measure the same trait but with differing degrees of accuracy. (Reynolds, 1982a, 


p. 194) 


Тһе important role of factor an 
an affective measure should be c 
approaches used to study bias ini 
structures) have focused upon co 
sen (1980) and Cole (1981) һауе r 
that for the cognitive measures stu 


alysis in studying the internal structure of 
lear. To date, though, the factor analytic 
nternal test structure (i.e., compare factor 
gnitive measures (Reynolds, 1982b). Jen- 
eviewed these approaches and conclude 
died, similar factor structures tended to 

be found for Blacks and Whites and low and high socioeconomic groups. 
For affective measures it appears that researchers have not adequately 
addressed the issue of test bias. Ina study reported by Benson (1982) the 
response patterns of Hispanic, White, and Black grade 8 students were 
d social attitudes instrument. A con- 


studied for a 32-item self-concept anc 
firmatory factor analysis technique using the LISREL IV program was em- 


ployed to statistically compare the groups on the factor structure identified 
for Hispanics by a classical factor analysis and varimax rotation. After iden- 
ture for the Hispanic group, three questions were 


tifying the factor struc | 
еве across the three groups: the number of factors derived for each 


group, correlations between the factors, and the errors in measurement. 
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The data clearly supported the existence of bias and identified items for 
further review. It appears that the confirmatory approach provided by the 
LISREL procedure is a promising statistical procedure for simultaneously 
investigating differential response patterns across different subgroups. 
Readers are also encouraged to review Mayberry’s (1984) study of item bias 


in attitude measurement, where biases related to content, format, and 
meaning were examined. 


Known Groups 


Construct validity can also be examined by showing that the instrument 
Successfully distinguishes between a group of people who are known to 
possess the characteristic and a group (or groups) who do not have high 
levels of the trait. The key to this analysis is that the existence of the trait in 
Ocumented with some external criterion. The 


the new instrument and an analysis such as the 


t-test is used to test whether the instrument has successfully described the 
known differences between the groups. 


Two general examples will clarify the known-groups technique. In the 
early development of Some paper and pencil anxiety measures, groups of 
high anxious and low anxious People were formed. The Participants’ palms 
were wired for the skin Sweating response (i.e. , galvanic skin response) and 
then an anxiety-provoking situation Was presented so that physical levels of 
ety ment gh and low anxious groups were then 
administered the anxiety instrument. If the resulting ми она were 
uld argue that the instrument mea- 


15 have used a clinical team 
having а certain personality 
he same trait was then 


Criterion-Related Validity 


Criterion-related validity addresses the question “Whar į 
between scores on the instrument and some external criteri 
more direct measure of the targeted characteristic?” 


$ the relationship 
on that provides a 
Depending upon the 
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ume frame employed in the validity inference, two types of criterion-related 
validity studies are possible: concurrent and predictive. 


Evidence of Concurrent Validity 


In studying concurrent validity, the instrument is administered and then at 
approximately the same time data are obtained from another instrument or 
through observation techniques (i.e., the criterion) so that the relationship 
between the instrument and the criterion can be established and analyzed 
with respect to the theory underlying the instrument and the criterion. The 
most frequent statistic employed is the simple correlation coefficient. 

For example, we could administer the Gable-Roberts Attitude Toward 
Teacher Scale to a class of students as part of a teacher evaluation project. 
Concurrently, the principal could observe several classes and rate the 
teacher on the same scale. The validity of the student perceptions as an 
accurate assessment of the teacher's behaviors could be ascertained by cor- 
relating the student and the principal ratings on the three GRATTS attitude 
scales (Presentation of Subject, in Job and Students, and Teaching 


Interest 1 
Techniques). If the correlation was n 


ear .90, we could conclude that the 
GRATTS possesses criterion-related validity for estimating the principal 
ratings. If this was the case, We could 


substitute the student ratings, gathered 
in a brief 10-minute scale administration, 


for the principal ratings, which 
entailed several time-consuming observations of the class. 
Readers may feel that this is really an ex 


ample of construct validity. It is 
true that the data analysis is the same, but note that the research question is 
different, Whereas іп an analysis О 


f construct validity we are studying the 
existence of underlying concepts or constructs to explain variation among 
items or scales, the criterion-related validity study addresses the viability of 
substituting a quick student rating procedure for a time-consuming principal 
Observation procedure. к EN z 

Another example from the cognitive domain will further illustrate the 
nature of concurrent validity. А test manual from an 10 measure contains a 
correlation of .90 between the long and short form administered concurrent- 
ly to a sample of students. The criterion-related (concurrent) validity of the 
short form is then supported asan estimate of the scores from the long form 
so that practitioners сап confidently use the short form in the midst of testing 
time constraints. 
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Evidence of Predictive Validity 


In other situations the instrument may be designed to predict a future char- 
acteristic or behavior. The instrument is administered and then at some later 
time measures are obtained on some external criterion. Correlational analy- 
ses (simple correlations, regression analyses, or discriminant function analy- 
sis) are most often employed as the primary statistical technique. If the 
prediction is successful, future users of the instrument could administer the 
instrument and then estimate the student’s status regarding the criterion 
variable. 

An example of establishing the criterion-related (predictive) validity for 
an attitude scale is found in a study by McCook (1973). A 36-item attitude 
scale entitled the McCook Alternative Programming Scale (MAPS) was 
developed as a means of predicting potential school dropouts (і.е., the 
criterion). Responses from 108 grade 12 nondropouts and 39 actual school 
dropouts were submitted to a factor analysis (construct validity) which re- 
sulted in 10 attitude scales (e.g., Parental Involvement in School, Attitude 
Toward Education, Importance of Peer Group Relations, and Stimulation 
of the School Environments). A discriminant-function analysis was then run 
to ascertain to what extent the attitude scales (predictors) could successfully 
classify the dropouts and nondropouts into their respective groups (i.e., the 
criterion). Seventy-four percent (74%) of the nondropouts and 94% of the 
dropouts were successfully assigned. Thus, the criterion-related validity of 
the MAPS was supported. 

Criterion-related validity studies are essential for cognitive instruments 
used in the area of personnel selection (see American Psychological Asso- 
ciation, 1975). Since these studies rarely include affective measures we will 
not detail them in this volume. We will, however, refer to them to further 
illustrate criterion-related validity. In the personnel field it is common for 
the selection measures used to assess aspects of the actual job content. In- 
cluded are tasks, activities, or responsibilities needed to successfully carry 
out the particular job. The assessment instrument is administered as a selec- 
tion measure on the logic that the people with the higher scores would per- 
form better in the actual job. The criterion-related validity study necessary 
to support use of the instrument in this manner could be carried out as 
follows: A pool of applicants is administered the selection measure. Ideal- 
ly, most of the applicants would, in this validity study, be admitted to the 
job class so that at a later time their peers and supervisors could rate their 
on-the-job performance (i.e., the criterion). The performance ratings would 
then be correlated with scores on the selection instrument and high correla- 
tions would be supportive of the criterion-related (predictive) validity of the 
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vx отне In practice, it is often difficult to admit most of those tested to 
е job class. In this situation the selection measure is administered to 
current employees in the job class so that their scores can be correlated with 
on the-job performance ratings. Naturally, those already on the job are a 
E рне. group, which tends to restrict the range of scores on both 
bes e ® hus, although these studies are used as predictive studies, they 
ways accepted as good evidence of criterion-related validity. Since 
proper criterion-related studies are difficult to carry out in the personnel 
area, the content validation study, which carefully specifies the correspon- 
dence of the job specifications with the items and domains on the instru- 
ment, takes precedence in the validation process. 
A criterion-related validity study from the affective area of interest clear- 
ly illustrates the predictive nature of the validity question. One measure of a 
successful use of an interest inventory is its ability to predict future actual job 
choices. To support the criterion-related (predictive) validity of the interest 
measures, researchers typically administer the interest inventory in high 
school and then follow the students for a few years to find out the actual 
areas of their job choice. For example, Silver and Barnett (1970) reported 
evidence for the criterion-related validity of the Minnesota Vocational In- 
terest Inventory (МУП) by examining job choices (e.g., building trades, 
electrical, and machine shop) made by high school graduates who completed 
the МУИ in grade 9. Similarly, Campbell (1973) has presented supportive 
criterion-related (predictive) validity for the Strong Vocational Interest 
Blank for Men (SVIB) indicating that approximately 75% of various sam- 
ples of college graduates wound up in jobs that were compatible with their 


high school (grade 12) SV/B profiles. 


Summary 


e issue of the validity of instruments for 
In general terms, the validity issue per- 
ical evidence regarding what affec- 


In this chapter we have examined th 
measuring affective characteristics. 


lained to gathering judgmental and empir 
tive characteristic is assessed by the instrument. The interpretations of data 


were based upon inferences made from the operational definitions (item 
stems) derived from the conceptual (construct) definitions underlying the 


instrument. 

Three types of validity, 
evidence, were described w 
finer interpretations of the charac 
Out alternate interpretations). Each type 0 


as well as appropriate judgmental and empirical 
hich focused on evidence that would contribute to 
teristics measured (i.e., that would rule 


f validity addressed a particular 
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question. Studies of content validity asked “To what extent do the items 
the instrument adequately sample from the intended universe of content? 
Studies of construct validity asked “То what extent do certain explanatory 
concepts (constructs) explain covariation in the responses to the instrument 
items?” Finally, studies of criterion-related validity asked “What is the rela- 
tionship between scores on the instrument and some external criterion that 
provides a more direct measure of the targeted characteristic? i 

It should be clear that the three types of validity are not independent 
considerations. Allthree typesshare the need fora clearly stated theoretical 
rationale underlying the targeted affective characteristic—it is on the basis 
of this rationale that validity is examined and the interpretations of the data 
from the instrument are refined. The types of validity differ in the nature of 
the validity question asked. Two researchers may both employ correlations 
in their study of validity, but one may be examining construct validity and 
the other criterion-related validity. 

All developers of affective instruments need to place great emphasis on 
establishing content validity. The intended use of the instrument will dictate 
the type and amount of construct and criterion-related validity evidence 
necessary. For example, measures of self-concept, attitude, and values will 


most likely emphasize construct validity; interest inventories will emphasize 
criterion-related validity. 


Notes 


‘Later we will note that 
regression weights. 
?To be technically correct we should use t| 


ап oblique rotation results in a factor pattern matrix containing 


he term principal-component analysis when l'sare 
in the diagonal and indicate that we have derived components instead of factors. In this volume, 
we will loosely use the term factor analysis instead of component analysis. 

? Readers should recall that the Г in the d 


iagonal of a correlation matrix represent the 
variances for each variable. While the items may initially have different variances, their covar- 
lances are normalized in generating the correlation matrix so that the items all have equal 


variables. Actually, all of the it 
the diagonal of R indicate th gth of 1 and now proceed to 
examine the angles between the correlations between the 
vectors on the basis of the (See the discussion of rotating factors in the 
illustration provided later in this section.) 

“Include “FORMAT-SORT/” on the rotation procedure cards to obtain sorted items by 
ranked loadings. 

5 Readers should note that 
coordinates are located as 
for the x axis is obtained 


plotting the data for an oblique rot 
parallel projections from the point to t 
by dropping a line parallel to the y ах 


ationis a little tricky in that the 
he line. That is, the coordinate 
is from the point to the x axis, 
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ar or parallel projections 


which results in a regression weight. In a varimax rotation perpendicul 
dicular projections yield 


are identical and result in correlations. In an oblique rotation, perpen 
correlations in the FACTOR STRUCTURE matrix; these should be ignored. 

©The correlations represent the cosine of the angle between the two factors. Noting that the 
cosine curve can be used to calculate angles for various correlations, readers may wish to 
estimate the actual angle between axes. For example, when the correlation is zero, the angle is 
90° (varimax). In our example, the correlation between Factors I and П is .47 so the angle 
between the axes in figure 4-2 is estimated to be 17972 


Cosine Симе 


90. 180 270 360 


anges 
7It is possible to assign опе item to two factors but the scoring and reliability computer run 

can be confusing. 
| 3See Coletta and Gable (1975) for an example of how cont 
in finer interpretations of the derived factors. ] Р 
9Only selected scales from ће КОР. SIV, and SOV were used for two reasons. First, the 
scales were theoretically important to the study; and second, at least one scale from each 
(i.e., all respondents obtain 


measure was deleted because the measures were ipsative in nature (i.e. 2 
the same total score). Chapter 3 discussed the implications of ipsative instruments. 


ent validity information assisted 
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5 THE RELIABILITY OF AFFECTIVE 
INSTRUMENTS 


In the previous chapter we examined the validity question, “Does the instru- 
ment measure what it is supposed to measure?” We now turn to the issue of 
reliability, which is concerned with the question “Does the instrument 
provide us with an accurate assessment of the affective characteristics?” By 
"accurate assessment" we mean scores that аге internally consistent upon 
one administration of the instrument as well as stable over time given two 
administrations (see Stanley, 1971). In this chapter we will explore the 
measurement theory underlying reliability and the evidence needed to sup- 
port its internal consistency and stability. Following this, we will summarize 
the factors affecting the level of reliability and discuss the relationship be- 
tween reliability and validity. The chapter will conclude with a presentation 
of an analysis of SPSS computer output pertaining to the reliability of the 
Gable-Roberts Attitude Toward Teacher Scale. 


Reliability Theory 


Any time we administer an affective instrument we obtain scores on the 
scales (item clusters) contained in the instrument. Recall that these scores 
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are most often the sums of the responses to a set of items that have been 
written to operationally define the affective characteristic. Inferences are 
then made from these scores back to the original conceptual definition of the 
characteristic. Obviously, these inferences are only as good as the amount of 
true-score variance in the observed score. That is, any individual’s observed 


total score (Хтот) actually consists of a true score component (Xrrue) and 
an error component (Ха) such that 


Хтот = Xrrue + Хе (5.1) 


The first part, X-rRug, reflects the portion of the individual's total score that 
is associated with a true reflection of the affective characteristic. That is, 
indeed, a hypothetical component since it reflects the individual's score 
obtained by a perfect measurement instrument under perfect conditions. 
Kerlinger (1973) notes that this score can be considered the mean of the 
individual's scores under repeated administrations assuming no learning 
took place. Quite obviously, we never see an individual's true score; we can 
only estimate the portion of the total score that is true. At the same time, 
each individual's total score includes an error component (Хе), which 
reflects the portion of the total score that we cannot explain. The raw score 
formula can also be written in terms of variance components as 


Vror = Vrrue + Ve (5.2) 


While it would be пісе to ђе able to measure ап affective characteristic 
with precise measurement tools, we know that the instruments we use lead 
to some errors in measurement. In assessing the reliability of an instrument 
we attempt to estimate the amount of error in the scores so that we can 
estimate the amount of true variance in the total score—the less error 
involved, the more reliable the measurement. 

Following Guilford’s (1954) traditional conception of reliability, Kerlin- 
ger (1973, p. 446) lists two definitions of reliability as follows: 

1. Reliability is the proportion of the “true” variance to the total variance 
of the data yielded by a measuring instrument; 


геј = Vrrue (5.3) 

TOT 
Reliability is the proportion of error variance to the total obtained 
variance yielded by a measuring instrument subtracted from 1.00, the 
index 1.00 indicating perfect reliability: 


(5.4) 
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Тһе first formula is certainly а theoretical formula since we never actually 
measure the true variance. The second formula, though, is practical in that it 
suggests that the empirical route for estimating the reliability of an instru- 
ment is to separate the amount of error variance from the total variance. 

Nunnally (1978) presents a description of this measurement error in the 
context of a domain sampling model. This model is consistent with our 
description of an instrument in chapter 2 as a sampling of items from a 
well-defined domain or universe of content. In fact, defending this sampling 
was the basis for establishing content validity. 

According to Nunnally (1978), an individual's true score is the score the 
person would hypothetically obtain over the entire domain of items. Thus, 
reliability can be conceptualized as the correlation between an individual’s 
Scores on the sample of items and their true scores. For any one item the 
correlation between the item and the true score equals the square root of 
the average intercorrelation of item 1 with all of the other items (/) in the 
domain as follows (Nunnally, 1978, р. 198): 


-Ут (5-5) 


ЈЕ j 
Since the average correlation of an item with all other items in the domain is 
the reliability coefficient, then the square root of the reliability coefficient 
equals the correlation of the item with true scores in the domain (i.e., the 
sum of all items in the domain) such that 


Tiru = V tel (5.6) 


By squaring both sides of equation 5.6 we can also state that the reliability 
coefficient equals the square of the correlation between an item and true 
Scores such that 


Paga Tel (5.7) 


Recalling that squared correlations indicate the proportion of shared 
variance, we can extend this to a cluster of items defining a scale on an 
instrument and state that, conceptually, the reliability coefficient indicates 
What percentage of variance in the scale scores can be considered "true" 
variance. For example, if the reliability of a scale is found to be .90, we can 
infer that 90%, of the total variance in the scale scores is true. 

. Nunnally's development of measurement error (1978) clearly shows the 
importance of employing reliable instruments in any research project. Un- 
fortunately, some researchers do not take the issue of reliability seriously 
enough and later find the results of their study to be confusing and dis- 


appointing. 
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Sources о? Error 


The previous section described the theory underlying reliability. It became 
clear that the vehicle for studying reliability was the estimation of the 
amount of error in the measurement. In this section we will explore the 
possible sources of error in the context of selecting an appropriate procedure 
for establishing reliability. 

We know that the more reliable an instrument, the less the error in the 
measurement. But where do errors come from? Actually there are several 
different sources of error depending on the nature of the test and how it is 
employed. According to Nunnally (1978), the major source of error within 
an instrument is due to inadequate sampling of items. This is consistent with 
our domain-sampling approach for establishing content validity. Given an 
adequate sampling of items from the domain, each individual theoretically 
has a given probability of agreeing with each item, and this probability can 
be generalized to establish the expected number of agreements within the 
particular sample of items. The source of error, then, comes from inade- 
quate sampling of items from the domain, which results in low probability 
agreement tendencies. This situation is directly reflected in the average 
interitem correlation, which would tend to be lower when inadequate samp- 
ling is present. In the next section we will discuss the reliability estimation 
procedure (i.e., Cronbach's alpha) appropriate for addressing this source of 
error. 

Several other sources of error can be labeled situational factors. For 
instruments measuring affective characteristics the most common sources as 
described by Isaac and Michael (1981, p. 126) are: (a) individual response 
variation due to such factors as fatigue, mood, or motivation, which could 
lead to random or careless responses; (b) variation in administration proce- 
dures (i.e., across time for two administrations or across classrooms during 
one administration) to include physical factors such as temperature and 
noise and psychological factors such as unclear instructions or lack of test 
time pressures; and, (c) errors in hand or computer scoring of the responses 
(e.g., failure to properly address missing data, which leads to zero scores on 
а 1-5 Likert scale). 

Administration of the same instrument on two different occasions is 
another source of error. Without some intervention we expect that most 
affective characteristics are fairly stable across a time period of, say, approx- 
imately three weeks. If we administer the same instrument to a sample of 
individuals and find that their scores are not stable, we could conclude that 
the variability in individual responses over time, as well as the situ 
factors described earlier, has contributed to an unreli 

It should be clear, though, that reliability like va 


ational 
able instrument. 
lidity is a generic term 
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which refers to different research questions and types of evidence. When 
reporting reliability data, researchers must clearly state the type of relia- 
bility addressed since it points to the sources of error being studied. We will 
now turn to the two major types of reliability evidence: internal consistency 
and stability. 


Types of Reliability Coefficients 
Internal Consistency 


The alpha internal-consistency reliability coefficient is deduced directly 
from the domain-sampling theory of measurement error (Nunnally, 1978). 
As such, it addresses the important source of error due to the sampling of 
items, as well as the situational factors described earlier where a single 
administration of the instrument is used. The formula is so important that 
Nunnally states: “И is so pregnant with meaning that it should routinely be 
applied to all new tests” (p. 214). The equation reads as follows: 


_ K q èo (5.9) 
rel u cr. "E 


where k = the number of items, 3o the sum of the item variances, and 
су = the variance of the total (scale) scores. Reference to formula (5.6) also 
Suggests that the square root of coefficient alpha is the estimated correlation of 
the scale scores with errorless true scores. Nunnally (1978) also notes that 
alpha represents the expected correlation of one instrument with an alterna- 
tive form containing the same number of items, which further clarifies the 
association of alpha with the domain-sampling theory of measurement 
error. | 

Тһе ideal way (о generate alpha reliabilities is to use a computer, which 
Will also generate several other item and scale level statistics. Readers 
should be aware, however, that alpha can be estimated quite quickly using 
a hand calculator. A simple example will be used to illustrate this point, and 
to further develop the concept of internal consistency of individuals' 
responses. 

Consider the following four attitude toward school items taken from а 


larger set of items: 


I like school. 

School is really fun. 

I enjoy going to school. 
School makes me happy. 
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Table 5—1. Interitem Correlations for Attitude Toward School Items 


1 2 3 4 
T s= 
2 40 == 

Items 3 .50 .55 - 
4 155 40 40 — 


Assume that 200 grade 6 students responded to these four items оп а 5-point 
Likert scale ranging from strongly disagree (1) to strongly agree (5). Rather 
than generating the applicable variances specified in equation 5.8, the alpha 
reliability can be estimated by using the interitem correlations (see Cron- 
bach, 1951) calculated when the factor analysis was run to examine construct 
validity. Table 5-1 contains hypothetical interitem correlations. The alpha 
reliability coefficient is generated in two steps. First, calculate the average 
interitem correlation, which is .47 for our example. This average interitem 
correlation is important in that low levels reflect both high amounts of error 
in the sampling of items from the domain of content and the possible exis- 
tence of situational factors that would influence the responses to the items. 
It is important to note that, in essence, calculating the average of the 
correlations estimates the reliability of a one-item scale. Therefore, it is 


necessary to estimate the reliability of the four-item scale by using the gener- 
al form of the Spearman-Brown Prophecy Formula, 


к 
= NN 59) 
ширге DF ( 


where K represents the number of times опе wishes to increase the length of 
the instrument апат represents the average interitem correlation. Inserting 
K — 4 (i.e., the original number of items) and 7 — .47 into the formula for 
our example yields an alpha of .78. Based upon this reliability coefficient we 
can then say that 7896 of the variance in the scale Scores (i.e., sum of four 


items) can be considered true variance, and that the estimated correlation of 
the scales with errorless true scores is the Square root of .78 or .88 (see 
equation 5.6). 


6 s , 5 Prior to generating (һе correlations. 
Failure to do this will result in negative correlations that appear to lower the 
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alpha reliability. Reverse scoring in the beginning of the analysis will allevi- 
ate the need to keep track of which items are negative stems. If you have not 
reverse scored the items, simply ignore the negative signs when averaging 
the correlations. If you did reverse score the negative items and still get 
negative correlations, you will get and, indeed, deserve to geta low alpha. 

Also check to see that you have recorded the proper number of interitem 
correlations if you are taking the values from a correlation matrix for a larger 
number of items. A handy check is to calculate the number of correlations 
you need as the “number of combinations of K (items) things taken two at a 
time" as follows: 


! 
K (5.10) 
2(К—2)! 
where K represents the number of items and 2 indicates you are correlating 
two items at a time.! For our four-item example, the number of combina- 


tions is figured to be 


4! 
2121 - 
For a six-item scale the number of combinations grows quickly to be 
! 
e _ 15 
214! 


Again, recall that figuring the number of correlations you need to average 
сап be done with equation 5.10, but the K in equation 5.9 represents the 
Original number of items in your scale. Using the value of K from equation 
5.10 іп 5.9 will result in a greatly inflated alpha coefficient. In the short run 
you will be quite pleased, but soon reality will set in and the proper alpha 
value will be lower. During the pilot testing of a new instrument, this type of 
error could be quite serious. 

In this section we have illustrated the procedure suggested by Cronbach 
(1951) to estimate an alpha reliability based upon interitem correlations. 
^ final example using the four attitude toward school items listed earlier 
will further clarify the concept of internal consistency. That is, just what 
15 it about individual response patterns that leads to high alpha internal 
Consistency reliabilities? To answer this question consider the two small 
hypothetical data sets contained in table 5-2. Example 1 illustrates internal- 
ly consistent responses across the four items defining the scale. Note that 
some of the individuals tend to consistently "agree" with the items—i.e., like 
School—while other students appear not to like school—they consistently 
"disagree" with the items. Picture two hypothetical students who really do 
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Table 5-2. Response Patterns Leading to High and Low Alpha Reliabilities? 


Example 1 Example 2 
Items Items 

1 2 3 4 1 2 3 # 

А 5 4 4 5 2 2 1 2 

В 2 1 2 3 4 1 4 1 

G 1 1 2 1 5 1 4 3 

D 4 5 5 4 2 4 3 x 

Р Е 5 5 5 5 2 4 1 

Individuals F 5 3 4 5 4 5 4 5 
G 2 3 1 2 2 1 5 2 

H 3 2 4 1 1 2 5 2 

I 5 2 2 Š 4 3 3 : 

J 2 5 3 4 2 2 5 = 


a Пет Stems: 1. I like school. 
2. School is really fun. 
3. Lenjoy going to school. 
4. School makes me happy. 
Response format: 5 = Strongly agree 
4= Agree 
3 = Undecided 
2 = Disagree 
1 = Strongly disagree 


like school (e.g., individuals A and D). When these individuals processed 
these four item stems, they perceived content similarities so that they re- 
sponded in a similar and consistent manner by tending to “аргее” with the 
Statements. Readers will recall that the items represent operational defini- 


tions for a concept called "attitude toward school." To the extent that we 
observe internally consiste 


evidence that can be report 
Example 2 illustrates response 


inconsistent responses. Fo 
the item “I like School," 
really fun.” 


alpha internal consistency reliability is 
1951) technique. Note that the average ІП- 
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Table 5-3. Interitem Correlations and Estimated Alpha Reliabilities 


Example 1 Example 2 
Item Item 
1 2 3 4 1 2 3 4 
3 1 — (43 44 .81 1 — =Ó .06 —.12 
Interitem 2 = Jo dM 2 — -.39 .30 
Correlations 3 = Б 2 = 188 
4 — 4 — 
. T-.58 т=.02 
Estimated 4(58) 4(02) 
Alpha __4¢5 "CAL 
| т=з E 


те! = 
Reliabilities 1--3(.58) 


teritem correlation (F = .58) is higher for example 1, which the domain- 
Sample model attributes to a more adequate sampling of items from the 
attitude toward school” domain. In example 2 we noted that respondents 
Perceived different meanings for the item stems than were anticipated dur- 
116 the content validation stage. Thus, the domain-sampling model suggests 
that these items are not representative or well sampled from the domain and 
therefore result in a lower average interitem correlation (F = .02). The 
effect of this situation on the alpha reliabilities is evident as example 1 has an 
alpha of .85 and example 2, of .08. If the items and data from example 2 were 
Part of a pilot test of a new instrument, we could only say that 896 of the 
Observed variance in attitudes toward school can be considered true 
над апа that the estimated correlation between scale scores formed 
tom these items and true scores is .28 (see equation 5.5). Example 2, then, 
Sends us back to step one where we reanalyze the operational definitions of 
the content domain and conduct a new content validity study. 
Ми split-half technique has also been used to examine internal- 
instr, CY reliability. In this procedure the instrument ora scale on the 
— is randomly split into two equivalent sets of items which repre- 
c: two samples of items from the content domain. The correlation of the 
Tes from the two halves is then entered into а special form of the 
Pearman-Brown Prophecy Formula, which follows from equation 5.9, to 
generate the reliability of the whole instrument as follows: 


2712 (5.11) 
1+2 


rel = 
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where 2 = the factor that i 
and rj}. = the correlation 

The major problem w 
Proper split of the whole 


ndicates that the instrument is really twice as long 
between the two half-instruments. Ed 
ith this technique is its dependence on С 
instrument оп the basis ofitem content. Given t ~ 
eadily produce Cronbach's alpha, which represen 


? ^ AE 2 e woul 
le splits of the Instrument, it is rare that one w 


want to use the split-half technique. 


š qure wncffectiVo 
responses over time. If one is using an affec 


: : assume 

x Post program evaluation model, we would like to pone 

that the differences between the Pre and post scores are due to a presi 
ability reliability in the instrument. On 


2 in the 
» Say, a three-week period in ns 
ange the trait. We then administer 


to 
able sample and correlate the test-retest scores 


ability. 
ons for plannin 
high stability 
Отаіп-ѕатр 
15, ап inadequate sample of ite 


5 a stability reliability study. First, „а 
reliability does not address the item 
ling model (see Nunnally, 1978, ch. a 
ms could result in an average interitem 
9 е ша yield а Very low alpha internal-com 
sistency reliability, These same items, though, could be found to have 4 
very high stability reliability sinc 


ays first establish the alpha à 
i иа ability reliability coefficient- 
An instrument with a low alph gh stability reliability should not be 


the test and retest is too short (іе. а few days), res 
repeat their recalled responses. коер 
Third, ме should note tha 


With extreme ratings 


MS, Or attempts to fake or give socially desired 


responses, can yield stable response patterns. Final] we uate 
students often report any form of reliability located Cape Pg test 
manual without carefully thinking through and defen iie Ls xi Eon 
reliability evidence needed for their Particular use of 05 
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Another common error is to report reliability data for samples that have 
little in common (e.g., grade level) with the samples to be employed in the 
Proposed research. 

It should be clear that reliability is a generic term, which necessitates 
different methods (i.e., internal consistency and stability) of generating 
reliability evidence to account for different sources of error. Clearly de- 
Signed and labeled reliability evidence should be present prior to using any 
Instrument. 


Factors Affecting Reliability 


The reliability of a set of items is affected by several factors: the character- 
istics of the sample, the homogeneity of the item content, the number of 
items, and the response format. It is essential that instrument developers 
understand how these areas potentially affect the reliability of a set of items. 
In this section we will discuss each area noting that they are not independent 
but most likely interact with each other to affect especially the internal- 


consistency reliability level. 


Sample Characteristics 


In pilot testing a set of items, the selection of the sample is crucial. The goal 
is to select a sample that exhibits the same level of variability in the affective 
characteristic as that existing in the target population. For example, if a set 
of attitude toward secondary school items is to be administered, the sample 
of high school students should reflect the entire high school population. it 
would be an error to administer the items to four easily available grade 9-12 
honors classes. Since these students would most likely exhibit generally 
Positive attitudes toward school, the variance in their responses would be 
less than that for the total high school population. Itisa well-known statistic- 
al fact that lowering the level of variance in either or both variables involved 
in a correlation necessarily reduces the size of the correlation. Thus, two 
particular items on the scale may actually correlate quite well, but the sam- 
ple characteristics have in effect put a ceiling on the level of correlation. Asa 
result, the average interitem correlation for the set of items will be low, as 
will the alpha internal-consistency reliability. The domain-sampling model 
will then lead the developer to conclude erroneously that the sampling of 
items from the universe of items is inadequate, when in fact the sampling of 
items may have been quite adequate, but the sampling of people inade- 
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items with a high average interitem Correlation will have a higher alpha 
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K: Factor for Increasing the Number of Items 


Figure 5-1. Average interitem correlations and alpha reliabilities estimated by in- 
creasing the number of items by а factor of K. 


reliability than a lot of items with a low interitem correlation. To illustrate 
this point, consider again equation 5.9. 

a Kr 

~14+(K-1)F 


where K = the number of items on the scale or the number of times you wish 
to increase the number of items and 7 = the average interitem correlation. 
Figure 5-1 contains reliability information generated using equation 5.9 for 
selected scales from Super's (1970) Work Values Inventory (WVI). The 
vertical axis indicates the average interitem correlations for five of the WV] 
3-item scales based upon a sample of 200 high school sophomores (Gable, 
1969). The horizontal axis contains the various values of K. Note that for the 
value K = 1 the average interitem correlations are indicated. By inserting 
the average correlations into equation 5.9 and incrementing K progressively 
we see how the estimated alpha reliability value increases as the number of 
items is increased. Since the original WVZ instrument contains three items 
per scale, the plotted values at K — 3 represent the alpha reliabilities for the 
WV! and the values for K > 3 represent estimated alpha reliabilities if the 
number of items on the scales were to be increased in future editions of the 
instrument. - 4 

Returning to our original point, it is clear that a few items with a high 
average r can have a higher alpha reliability than a lot of items with a smaller 


rel 
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Suppose that we have a new instrument (A) and a known instrument or a 
criterion measure (B). If the reliability of both instruments is .81, the max- 
imum validity possible is .81. Consider the situation where the reliability of 
both instruments is as low as .49 and the maximum validity is, therefore, .49. 
It is not uncommon for researchers to conclude that a new instrument is not 
valid on the basis of a lower than theoretically expected correlation with a 
known measure, when the real problem was а lack of reliability. That is, the 
theory was great, but so was the amount of error variance in the data from 
each instrument. 

In this section we have discussed the relationship between reliability and 
validity. Reliability is always important, but validity remains as the ultimate 
question. Researchers should be aware that some instrument manuals pre- 
sent the more easily obtained evidence of reliability and then try and argue 
that validity follows. Meaningful validity evidence may be somewhat more 


difficult to obtain, but must be present in all technical manuals before users 
can confidently make meaningful score interpretations. 


An Illustration 


In this chapter we have emphasized the importance of the alpha internal- 


consistency reliability coefficient and noted that its square root represented 


the estimated correlation of the score with errorless true scores. Recall that 
Nunnally (1978) even stated that coefficient alpha “i 


: ; : is so pregnant with 
meaning that it should routinely be applied to all new tests" (p. 214). In light 
of its importance, we have attempted to develop an understanding of the 


concept of internal consistency by illustrating how responses were internally 
consistent and how to estimate alpha using interitem correlations. In the 
process of developing a new instrument or studying an existing one much 
more statistical information is needed in addition to the actual alpha co- 


efficient. Basically, this information reflects item-level statistics which 
contribute to the level of alpha. 


In this section we will illustrate how the excel 
gram (SPSS Update 7-9 or SPSSX) can be used to generate the necessary 
item analysis and reliability information (SPSS Inc., 1983). For the illustra- 
tion we will use the SPSS* program for the Gable-Roberts Attitude Toward 
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studies” measured by an instrument with a reliability of .50. Employing a 
t-test the researcher found no differences between the groups. It could well 
be that 50% of the variance was error variance, which greatly increased the 
estimated error in the sampling distribution of mean differences (i.e., the 
denominator of the t-test). This would have concealed the large treatment 
effect that was really present.? Thus, the use of unreliable affective instru- 
ments as dependent variables in a program evaluation study could readily 
result in what statisticians call a “Type П error’”’—failure to reject a false null 
hypothesis (i.e., the program really worked and you say it did not). Consider 
also a similar problem encountered in the use of affective measures in a 
regression analysis. The purpose of the analysis is to explain variation in the 
dependent variable using a set of independent variables. We often forget 
that if the dependent variable is not highly reliable, we are trying to explain 
the unexplainable. Unreliable predictors compound the problem even 
further. The result will be a small multiple correlation and a frustrated 
researcher who concludes that the independent variables were not well 


selected. It could be that the variables were theoretically sound but not 
accurately measured. 


An even more serious problem develo 
sures are used to make decisions about in 
we discussed the criterion-related 
Alternative Programming Scale (Мс 
designed to estimate (һе probability 


PS when unreliable affective mea- 
dividuals. Recall that in chapter 4 
(predictive) validity of the McCook 
Cook, 1973). This attitude scale was 


robability that future students could be expected 
to drop out of school. Using discriminant-function analysis during the de- 


velopment of the instrument, a regression equation was developed which 
was found to be quite accurate in classifvi 


being identified for needed services. 

The point is that no research will be success; 
the variables is not reliable. We cannot me 
precise 12-inch rulers. Before we proceed into 
be certain that the scores will contain only small levels of error variance or 


our inferences from the operational to the conceptual definitions of the 
affective characteristics will be inaccurate and misleading. 5 


ful when the measurement of 
asure affective variables with 
апу гезеагсћ project, we must 
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The Relationship of Reliability to Validity 


In this chapter we have described reliability as an indication of the propor- 
tion of variation in test scores, which could be considered true variance as 
Opposed to error variance. Further, we noted that evidence of the reliability 
of a measure could be in the form of the internal consistency of responses 
upon one testing or the stability of scores across time. But is a reliable 
instrument always a valid instrument? The answer is clearly no! 

It is commonly stated that “reliability is a necessary but not a sufficient 
condition for validity.” By this statement we mean that it is clearly possible 
for an instrument to have high internal consistency and stability reliability 
but low validity. For example, we could administer a set of 20 items to a 
sample of students and find that the internal consistency and stability (upon 
à retest) reliability of the responses were quite high. The validity of the in- 
strument depends upon what we claim the items measure. If we claim that 
the 20 items reflect attitudes toward school, can we offer validity evidence to 
Support this claim? Assume that you look at the items and conclude that the 
item content reflects self-concept and has little reference to school situa- 
tions. You are, in fact, questioning the content and construct validity of the 
items. It would be appropriate to examine the construct validity in the light 


of correlations with other known instruments or possibly a factor analysis. 
only be as clear as the item 


Note, though, that the factor analysis could 


content would allow. | 
Thepointisthata reliable instrument may or may not be valid. But a valid 


instrument is usually reliable. Why is this generally the case? Well, if an 
instrument is carefully developed so that (1) clear judgmental evidence 
exists to support the correspondence of the operational and conceptual 
definitions and (2) the empirical evidence based upon actual response data 
correlations with other known measures or factor analysis is supportive, the 
instrument should be reliable. If the instrument contained a large portion 
of error variance such that the responses fluctuated erratically, the correla- 
tions with other known instruments would not have been high nor would 
the factors be meaningful during the construct validity studies. That is, 
meaningful correlations during the construct validity study should in almost 
all cases result only between items or variables that share reliable variation. 
Thus, evidence of reliability is important in instrument development—it is, 
again, a necessary but not a sufficient condition for validity. 

Finally. consider how reliability sets a ceiling on the magnitude of the 


validity coefficient: 
maximum validity = V rel(test A) x rel(crit. B) 
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average r. For example, for the Altruism scale 7 = .66 (K = 1), and the 
3-item scale has an alpha of .83; for the Independence scale 7 = .21, the 
3-item scale has an alpha of .43, and the estimated alpha for a 12-item scale is 
.73. Since the WVI contains 15 scales at 3 items each, it would be difficult to 
revise all of the scales to have adequate reliabilities since half of the scales 
had average interitem correlations below .40 and would need а total of about 
8 items per scale (і.е., 5 new items) to result in scales with alpha’s above .80. 
Typically, good affective measures have average interitem correlations in 
the .30—.40 range. Therefore, it usually takes 8-10 items to generate alpha 
reliability levels in the .80 vicinity. The result of such revisions would clearly 
lead to an instrument with too many items for an appropriate time period for 
administration. See Gable and Pruzek (1971) and Gable (1969) for a discus- 
sion of how factor analysis and scale revisions could address this problem. 
In addition to equation 5.9, the following equation can also be used to 


estimate the alpha reliability of a scale if the number of items were increased 
by a factor of K. 


Krel 
re|=—— 5.12 
1+(К— 1) rel (555 
Actually, 5.9 and 5.12 аге the same in that the av 
іп 5.9 is the reliability, in this case for a l-item s 
5.12, assume we have а 10-item scale with an alpha of .60 and 10 new items 
are added from the same domain. Since the 10. 


) -item scale is now twice as long 
as the original scale, the estimated reliability (K = 2) for the new scale would 
be .75. 


erage interitem correlation 
cale. To illustrate equation 


A final equation is available to assist in dete 
needed on a scale to generate an adequ 
level. 


rmining the number of items 
ate internal-consistency reliability 


K = 1085 (1 — relex) (5.13) 
relgx (1 — relpes) | 
where геј ђе = desired reliability, 
relex = existing level of reliability, and 
K = number of times the scale needs 
relpgs. 
This formula is useful in that it allows one to calculat 
items needed instead of inserting various values of 
trial-and-error basis. For example, if we have a 3-ite 
of .64, we would calculate the factor of K to re 
following manner. 


to be increased to yield 


€ directly the number of 
K into 5.9 or 5.12 опа 
m scale with a reliability 
ach a level of .80 in the 
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„80 (1 — .64) _ 5 52 
== — =: =2,25 
Кеа 80) 


A common error іп using this formula is to conclude that K represents the 
number of items to add to the scale. Note carefully that K represents the 
factor by which the original number of items should be multiplied to 
lengthen the original scale. In our example, the original 3-item Esthetics 
scale should be increased by a factor of 2.25. To be safe we round the factor 
upward and conclude that we need 3 x 3 or a total of 9 items (i.e., 6 new 
items). Reference to Figure 5-1 indicates that the 9-item scale would have 
an estimated alpha above .80. Once again we see, as Nunnally (1978) points 
out, that scales with low average interitem correlations (e.g., WV Esthetics 
in figure 5-1,7 = .37) will need several items to reach ап acceptable reliabil- 
ity level. 

Before leaving this topic we should note that the items added to an 
existing scale to enhance reliability should clearly parallel the best items on 
the existing scale. The best items are those with the highest item/scale cor- 
relations. Simply look at the content of the item and write a parallel item. 


Acceptable Levels of Reliability 


The level of reliability considered to be acceptable for affective measures 
depends in part on the use for which the instrument is intended. Before 
levels appropriate for particular users are suggested, some overall com- 
ments can be made. In general, affective measures are found to have slightly 
lower reliability levels than do cognitive measures. Apparently this is the 
case because cognitive skills tend to be more consistent and stable than most 
aracteristics. Thus, it is typical for good cognitive measures to 
and stability reliabilities in the high .80s or low .90s, where even 
good affective instruments frequently report reliabilities as low as .70. 

The difference in the reliability criterion level also reflects the nature of 
the decisions to be made based upon the obtained scores. Several crucial 
programming decisions (e.g. special education placement and college 
admissions) are often based upon the results of cognitive achievement mea- 
sures. On the other hand, many researchers feel that the data resulting from 
affective measures will be used for only descriptive purposes. While the 
criterion level for affective measures can be reasonably set at a minimum of 
170, there are situations and considerations where higher levels may be 
necessary. Consider for example, a researcher who compared two methods 
of teaching social studies on the dependent variable “ан иде toward social 
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А — : a ill 
Were generated and will be compared later in this section. First, we w 


of the reliability Output for the varimax solution, 


and the statistics desired. N 


4-6) were reverse Scored earlier in the program. 


» Presentation of Subject, are presented A 
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Correlations. 
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should be targeted for review. The next column is labeled squared multiple 
correlation and contains the correlation between the particular item and a 
linear composite of the remaining items defining the scale. This value is 
calculated using multiple regression where the target item is the criterion 
and the remaining items are the predictors. These data are not particularly 
useful since the values will for all practical purposes relate highly to the 
item-scale correlations in the previous column. The final column, alpha if 
item deleted, is extremely important as it indicates the level of alpha reliabil- 
ity present if the particular item is deleted from the scale. Finally, we note 
that below these columns two values of alpha are presented. The first value 
labeled “alpha” is the alpha reliability coefficient generated using equation 
5.8. The second alpha labeled “standardized item alpha” represents the 
estimate of alpha generated through the correlation technique as specified in 
equation 5.9. This technique standardizes each item by dividing it by its 
respective standard deviation found during the correlation process. The 
resulting two alpha values will be quite similar; most researchers report the 
alpha from equation 5.8. 

Now that we have discussed the components of the reliability output, we 
can suggest a strategy for using the output. As with many canned computer 
programs, although one has much data available, only certain information 
will be considered essential. The suggested procedure is as follows: 


1. Check the overall alpha reliability (p. 13) noting that you are looking for 
at least a .70 but would be most pleased with a value greater than .80. 
For these data we have a value of .88 on Factor I. 

2. Return to the means and standard deviations (p. 12) and look for 
relatively high or low means and associated low standard deviations. 
The data for Factor I look fine. 

3. Examine the item intercorrelations to ascertain if you have any low or 
negatively correlated items. For affective scales you are hoping for cor- 
relations in the .30-.50 range, since about eight of these items will yield 
avery adequate alpha. 

4. For ће five columns of item/scale statistics (p. 13) focus directly on two 
columns: corrected item-total (scale) correlation and alpha if item de- 
leted. These two columns are loaded with crucial information regarding 
which items are not contributing to a high reliability level. In general, 
you will find that deleting items with the lower item/scale correlations 
will enhance the alpha level. This will be particularly true for scales with 
few items. If ascale has a large number of items (e.g., 15) the deletion of 
an item with the lowest item-scale correlation may not alter the reliabil- 
ity at all since there are so many other items on the scale. For the data 
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Table5-5. Alpha Reliabilities for Attitude Toward Teacher 
Varimax and Oblique Rotations 


Number of Items Number of Items Alpha Reliability 
Common to Both 


Factor Varimax | Oblique Solutions Varimax | Oblique 
Presentation 

of Subject 8 8 8 .88 .88 
Interestin Job 

and Students 7 6 6 .87 .86 
Teaching 

Techniques 7 5 5 .84 81 


presented here for eight items defining Factor 1, item 14 has the lowest 
correlation with the remaining seven items on the scale. (Readers 
should note that item 14 also exhibited the lowest loading on Factor 1 in 
table 4-10). Deleting the item doesnot increase the overall reliability of 


-88 generated for the total of eight items. Thus, no changes should be 
made in the items defining Factor I. 


We have examined the reliabilit 
to examine Factor II 
Techniques). What 
It should be clea 


y data for Factor I. Readers can proceed 


(Interest in Job and Students) and Factor Ш (Teaching 
decisions would you make? 
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сасһ factor, the number of items common to both solutions, and the alpha 
reliabilities. It is clear that the varimax and oblique solutions revealed the 
same factor structures (see tables 4-10 and 4-11). As usual, the oblique 
solution tended to clean up the varimax solution slightly by deleting one item 
from Factor П and two items from Factor Ш which had the lowest loadings 
оп the varimax solution. Since we know that these items all had moderate 
interitem correlations and that the number of items relates to the level of 
alpha, we expect the slight decrease in the level of alpha from the varimax to 
the oblique solution. In this data set the solutions are indeed comparable 
and the associated alpha levels are quite high; either solution could be 
chosen to describe the factor structure of the items. In other data sets the 
factor structures could be rather different. The outcome of inspecting the 
conceptual interpretation of the factors as well as the associated alpha 
reliabilities could clearly favor one of the solutions. 


Now It’s Your Turn 


iability information will be presented for 
rvey administered to managers ina 
nd Boone, 1983). Following a brief 
dents, item analysis and reliability 
5 of questions to be considered by 


In this section item analysis and rel 
a pilot version of a 40-item attitude su 
large corporation (Keilty, Goldsmith, a 
description of the survey and the respon 
data will be presented along with a serie 
the reader. 


Description of the Instrument 


ins 40 situation statements which the 
le ranging from 1 (almost never) to 5 
thin each of the four basic styles of 
3: participating; S4: delegating) in 


The Manager Concern Survey conta 
respondent rates on a Likert-type sca 
(almost always). Scores are obtained wi 


leader behavior (51: telling; 52: selling: 5 
two domains of сопсегп for people (human, 5 items) and concern for tasks 


(task, 5 items). Responses to the 40 items are reported at the human- and 
task-domain level within each of the four styles, yielding a total of eight 
Scores. 


Sample 


Data were obtained from 88 Self ratings and 277 Other ratings by managers 
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2 -Task 
Table5-8. Interitem Correlations for Manager Concern Survey: Other S3-Tas 
Items 


Items 9 14 20 28 3l 
9 жн 
14 25 — 
20 417 LT = 
28 22 08 E - 
31 =.16 —.22 —.06 —.08 = 


і і : lon. 
during a series of professional seminars conducted for a large corporti 


Item Analysis/Reliability 


Tables 5-6 and 5-7 present the item analysis and reliability data for the ed 
and Other forms. The situations or items on the scale have been rc 
based upon their respective domain (i.e., human, task) within each o ae 
four styles (i.e., S1, 52, $3, S4) of leadership. This grouping ana "° 
analysis of the items and is consistent with the scoring scheme employe eue 
the scale. Presented in the table are the response percentages, means, Н 
dard deviations, correlation (r) with the domain, domain reliability if t 
item is deleted, and the overall domain internal consistency reliability. 


Research Questions 


: . -7 
Considering the data Presented in table 5—6 for the Self data and table 5 
for the Other data, how would you answer the following questions? 

1. How would you describe the Tesponse patterns? 


ard 
2. How do these response patterns affect the item means and standar 
deviations? 


3. How are the correlations with the domains (i.e. , scales) calculated? 
4. Consider the correlations of the it 


B ; in 

ems with the domain and the са и 
alpha reliabilities if an item is deleted. What items would you target 
review? 


5. (а) How do the correlations of 
main alpha reliabilities? 


(b) Explain why the 53 Task domain on the Other form has an alpha са 
only .26 while the 54 Human 


domain has an alpha of .78 (see table 
5-7). 


T А А Я о- 
items with the domain explain the d 
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6. Table 5-8 contains the interitem correlations for the 53-Тазк items on 
the Other form. (Assume that appropriate items have already been 
reverse scored where necessary.) Based upon these intercorrelations 
and the resulting domain alpha reliability of .26, do/answer the fol- 
lowing: 

(a) Calculate the estimated alpha reliability using equation 5.9. 

(b) On the basis of all the data presented, discuss how item 31 contri- 
butes to the alpha level. 

(c) How many additional items would be needed to raise the alpha level 
to at least .80? 

(d) Suggest a specific plan for revising the s 
of .80. 


et of items to obtain an alpha 


Notes 


"The symbol! indicates “factorial.” For example, 4! =4 Х 3x 2X 1. _ 
1 То illustrate: consider the separate variance model where t = (X, — X2)/ SVN, + 55/ У. 
Nereased error variance could increase the values of S? and reduce the value of t. 
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6 А REVIEW ОЕ THE STEPS 
FOR DEVELOPING AN 
AFFECTIVE INSTRUMENT 


Overview 


Inchapter 1, we discussed the theoretical basis underlying selected affective 
characteristics. In chapter 2, the conceptual definitions were operational- 
ized by developing belief statements or, in the case of the semantic differen- 
tial, bipolar adjective pairs. Chapter 3 described the standard techniques for 
scaling affective characteristics. Techniques described included Thurstone’s 
equal-appearing interval, latent trait, Likert’s summated ratings, and 
Osgood’s semantic differential. It was pointed out that all of the techniques 
share the common goal of locating a person on а bipolar evaluative dimen- 
sion with respect to a given target object. For each technique the scaling 


process resulted in a single affective score arrived at on the basis of responses 


to a set of belief statements. Similarities and differences among the scaling 
techniques were presented. Finally, chapter 3 discussed the practical and 
psychometric differences between ipsative and normative measures. 
Chapters 4 and 5 discussed theory and techniques for examining the 
validity and reliability of the affective instrument. Emphasis was first placed 
upon examining the correspondence between the judgmental evidence col- 
lected in the content validity study and the empirical evidence gathered to 


169 


ІМ 
170 INSTRUMENT DEVELOPMENT IN THE AFFECTIVE РОМА 


Table6-1. Steps Іп Affective Instrument Development 


Step Activity Chapter 
1 Develop Conceptual Definitions 1 
2 Develop Operational Definitions 2 
3 Select a Scaling Technique 3 
4 Conduct а Judgmental Review of Items 4 
5 Select a Кевропве Format 3 
6 Develop Directions for Responding 
7 Prepare a Draft of the Instrument and Gather Preliminary 

Pilot Data 

8 Prepare the Final Instrument 

9 Gather Final Pilot Data 
10 Analyze Pilot Data 5 
11 Revise the Instrument 
12 Conduct a Final Pilot Study 
13 Produce the Instrument 
14 Conduct Additional Validity and Reliability Analyses 4,5 
15 Prepare a Test Manual 


examine construct validity. Finally, the importance of high alpha internal- 
consistency reliability was discussed. 


t à conceptual 
Portant theo 
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Step 2: Develop Operational Definitions (Chapter 2) 


After careful consideration of the literature review and the conceptual 
definitions, operational definitions are developed—these are the belief 
statements to be used in the instrument. For the Equal-Appearing Interval 
(see Thurstone, 1983a) and Latent Trait (see Wright and Masters, 1982) 
techniques the developer attempts to develop statements that span the 
favorable, neutral, and unfavorable points of the continuum underlying the 
affective characteristic. On the other hand, the Likert (1932) Summated 
Rating technique requires statements that can be easily judged to be either 
favorable (i.e., positive) or unfavorable (i.e., negative) in direction. Neutral 


Statements do not fit the Likert technique. For Osgood’s Semantic Differen- 


tial (Osgood et al., 1957) pairs of bipolar adjectives are selected to form the 


extremes of the favorable or unfavorable continuum. Careful thought needs 
to be given to selecting the bipolar adjectives from Osgood’s suggested 
evaluative, potency, and activity dimensions. In most studies (e.g., program 
evaluations) the semantic-differential scales will be comprised of evaluative 
adjectives, with possibly a few adjectives from the potency and activity 
dimensions included as anchors to clarify the interpretation of a later factor 


analysis. 


Step 3: Select a Scaling Technique (Chapter 3) 


ss, but the scaling technique needs to be 
described in chapter 3 are appropriate 
With Likert’s procedure appearing to be the most popular at the current time 
with the more psychometrically complex latent trait technique gaining 
quickly in popularity. The selection of a technique will have implications for 


how the remaining steps are conducted. 


It may seem a little early in the proce 
selected next. All of the techniques 


Step 4: Conduct a Judgmental Review of Items (Chapter 4) 

ontent validity is addressed as the statements аге 
erts. For the Thurstone (1931a) Equal-Appearing 
Interval procedure, two types of judgment are appropriate. First, the 
judges rate the statements with respect to how much they relate to the 
conceptual definition of the affective characteristic, keeping in mind the 
reading level of the target group. If more than one scale is to be included in 
the instrument, the judges should also rate the assignment of items to scale 


In step 4 the issue of С 
reviewed by content exp 
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categories. After any necessary revisions based upon this review are joe 
the second judgment needed in the Thurstone technique is the sp: is 
favorable or unfavorable rating by the judges for each item. This е» 
used to develop the scale value that locates the statement on the ae af 
continuum underlying the affective characteristic. Include several mo 
about 20 judges each to ascertain if the scale values are stable across di ally 
ent types of judges. Finally, select items that are nonambiguous and equ 


В > hed iouitv. (See 
spaced along the response continuum using the Criterion of Ambiguity. ( 
chapter 3 fora description of these proce 

For the latent-trait 


tain how well they relate to the concep 


For Osgood's Semantic Differential (Osgood et al., 1957) technique x 


Д jectives 
ce pertains to the selection of bipolar adjecti йй 
used to апсһог the scales hey relate to the concept to be rated. | 


Concept suggest that mostly evalu 
be used. 


with known word lists (see Da 
1953 for elementary grades). 


Step 5: Selecta Response Format (Chapter 3) 


techniques degrees of agreeing, importance 
most often using a 5-point scale. Fin: 


› ог frequency are recorded, 
the bipolar adjectives are listed att 


ally, for Osgood's Semantic Differentia 
he ends of the response continuum for 


6 5ТЕР5 FOR DEVELOPING АМ АЕЕЕСТІУЕ INSTRUMENT 173 


each scale (item). Тһе steps between (һе two adjectives аге generally indi- 
cated by unlabeled spaces. 


Step 6: Develop Directions for Responding 


Respondents, especially young students, should never be confused by in- 
complete or vague instructions. The procedures for responding to the state- 
ments as well as the meaning of the anchor points on the continuum should 
be carefully developed and reviewed by your colleagues as well as a few 


members of the target group. 


Step 7: Prepare a Draft of the Instrument and 
Gather Preliminary Pilot Data 


a draft of the instrument. Work with a good 


You are now ready to type 
nd type a draft of the form. Show the 


Secretary to design a tentative layout a 
instrument to two or three appropriate teachers and colleagues for final 


review of such areas as clarity of directions, readability, and ease of respond- 
ing. Also, administer the instrument to a representative sample of about 10 
students and watch them complete the form. Following the session, discuss 
the form with them and obtain their reactions to the areas listed above. 
Listen well and take good notes since a few perceptive student comments 
could be of immense importance to the success of your project. 


Step 8: Prepare the Final Instrument 


How professional the final instrument appears to the respondent is an im- 
portant consideration. Avoid at all costs an inferior layout and typing job 
with copies of the form run off on a ditto machine. Take the draft of the 
instrument to a professional printer and obtain advice regarding such mat- 
ters as layout, size and color of paper, and type of print to be used. If you 
cannot afford to have the instrument professionally typeset, find an experi- 
enced typist and prepare several drafts until the layout looks pleasing. Then 
take the original to a printer to be copied on colored paper or copy the form 
yourself on a good Xerox machine. If you want the respondents to take their 
job seriously, show them that the project is important by supplying them 


with a well-designed, easily read form. 
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Step 9: Gather Final Pilot Data 


Now that the instrument has been 
the examination of validity and rel 
people the size of which is such th 
there are statements on the inst 
from 240 to 400 people). You m 
pilot study, but keep in mind th 
ity, and scoring scheme for the 


produced you are ready to gather data i 
iability. Locate a representative sample o 

at there are 6 to 10 times as many people as 
rument (e.g., for a 40-item instrument use 
ay think that this is a large sample for qs 
at the empirical basis of the validity, d 
instrument will be determined or confirme 


pect to ability, sex, and 


УРез of schools (e.g., rural, urban, and sub- 
urban). 
Finally, be aware that the factor struct ; istic 
у ure of thi racteris 
may not be the same across diff е affective cha 


Шегеп age 
grades 9-12, you may wish to 


groups. If your target population is 
compare the factor Structure for grades 9 ana 
ned grades 11 and 12. A common error is to 
calculate the alpha reliability on a middle 
Toutinely use the instrument for different 
Onsideration of the 
dability of the items may reveal that the 


aracteristic at the lower grade 


Step 10: Analyze Pilot Data (Chapter 5) 


Analyses of the pilot data employ the techniques of factor analysis, item 


6 STEPS FOR DEVELOPING АМ AFFECTIVE INSTRUMENT 175 


analysis, and reliability analysis. 


Factor Analysis. If you have responses from 6 to 10 times the number of 
people as items you could proceed directly to the factor analysis (see chapter 
4) to examine the response data-generated constructs that explain the varia- 
tion among the items on the instrument. These empirically derived con- 
structs are then compared with the judgmentally developed categories re- 
viewed previously during the examination of content validity in step 4. If the 
empirically derived constructs and the judgmentally created categories do 
Not correspond, the conceptual and operational definitions of the affective 
characteristic (see chapters 1 and 2) should be reviewed in light of the 


characteristics of the target group of people. 


an be conducted along with or even prior 


Item Analysis. Ап item analysis с 
few people in the pilot study, the item 


to the factor analysis. If you have too 
analysis can be used to identify items to delete from the instrument prior to 


running the factor analysis. The item analysis will generate response fre- 
quencies, percentages, means, and standard deviations. Items associated 
With either high or low means and low standard deviations should be re- 
viewed and considered for deletion (see chapter 5). Also, generate correla- 
tions of the items with the appropriate scale or total score for the Likert 
technique and between the bipolar adjective scale and the derived concept 
dimension or total score for the semantic differential. The distinction be- 
tween a scale and total score for the Likert technique is based upon whether 
the set of items measures more than one dimension of affect as indicated in 
the factor analysis. Items should be correlated with the scale score defined 


by the cluster of items defining the scale (see chapter 5). 


of the pilot data consists of examining the 
f the item clusters defining each scale on the 
nsions on the semantic differential. The 
SPSS Reliability program is recommended for this analysis. For the Thur- 
stone items the binary response pattern (і.е., 0- item not selected and 
1 = нет selected) can also be analyzed using the alpha reliability formula. In 
addition to the overall scale- or dimension-reliability coefficient, the SPSS 
program will indicate the reliability of the cluster of items if each respective 
item is deleted. If there are only 6-8 items per scale the deletion of an item 
with a low item/scale correlation should result in a higher scale reliability for 


the remaining items (see chapter 5). 


Reliability. The final analysis 
internal-consistency reliability o 
Likert instrument or concept dime 
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Step 11: Revise the Instrument 


Based especially on the information obtained fromstep 10, the nextstepisto 
carry out final revisions of the instrument. Items can be added, deleted, а 
revised to enhance clarity of the items and the validity and reliability of the 


instrument. 
Step 12: Conduct а Final Pilot Study 


If substantial chan 
should be obtaine 


Step 14: Conduct Additional Validity and Reliability Analyses 


Now that you have evidence of the factor structure of the instrument as well 
as the item analysis and reliability i i 


уе » especially stability reliability, should 
also be gathered to ascertain if the measured affective characteristic is stable 
lexpectations. 
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Step 15: Prepare a Test Manual 


The final step in the process of instrument development is to share your work 
with other professionals. A short (і.е., 10-page) manual should be written 
documenting such areas as theoretical rationale, the process followed to 
develop the instrument, scoring procedures, validity, reliability, and score 
interpretation. Readers are encouraged to consult the APA publication 
entitled Standards for Educational and Psychological Tests (1985) for guide- 
lines in preparing the manual. 

In summary, developing a good instrument is a lot of hard work. If one 
cuts corners and takes the quick and easy route, the product will most likely 
reflect the level of effort. 
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APPENDIX A 


SEMANTIC DIFFERENTIAL: ME ASACOUNSELOR 
DEVELOPED BY SALVATOREJ. PAPPALARDO 


INSTRUCTIONS 


The purpose of this study is to measu 
people by having them judge them ag 


re the MEANINGS of certain things to various 
ainst a series of descriptive scales. In taking this 
test, please make your judgments on the basis of what these things mean TO YOU. 
On each page of this booklet you will find a different concept to be judged and 
beneath it a set of scales. You are to rate the concept on each of these scales in order. 


Here is how you are to use these scales: 
the top of the page is VERY CLOSELY RELATED to 


If you feel that the concept at 
the one end of the scale, you should place your check-mark as follows: 
unfair 


far X : ene Жа 


or 
X unfair 


fair š du 
If you feel that the concept is QUITE CLOSELY RELATED to one or the other end 
of the scale (but not extremely), you should place your check-mark as follows: 


X strong 


weak E => 


or 
> NS strong 


weak $ : 
ONLY SLIGHTLY RELATED to one side as opposed to the 
really neutral), then you should check as follows: 


^ BE : 2 8 раввіуе 


If the concept seems 
other side (but is not 


active 
or 


P dete 8 passive 


active : 
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The direction toward which you check, of course, depends upon which of the two 
ends of the scale seems most characteristic of the thing you're judging. 


If you consider the concept to be NEUTRAL on the scale, both sides of the scale 
EQUALLY ASSOCIATED vwith the concept, or if the scale is COMPLETELY 


IRRELEVENT, unrelated to the concept, then you should place your check-mark in 
the middle space: 


safe З $ „Алу: dangerous 


IMPORTANT: Place your check-marks IN THE MIDDLE OF SPACES, not on 
the boundaries. 


Be sure you check every scale for every concept. DO NOT OMIT 
АМҮ. 


Never put more than опе check-mark ona single scale. 


DONOTLOOK BACK AND FORTH through the items. Do not try to remember 


how you checked similar items earlier in the test. MAKE EACH ITEM A SEPA- 
RATE AND INDEPENDENT JUDGMENT. It is your first impressions, the im- 
mediate “feelings” about the items, that we want. On the other hand, please do not 
be careless, because we want your true impressions. THANK YOU. 


NAME _ з: SEX DATE 
CONCEPT: МЕ AS GUIDANCE COUNSELOR 
1. CARELESS eee 
С x 2 š, : Б EFUL 
2. CONFIDENT : 5 : — š СЕНЫ АН 
3. UNFAIR uy uer = == оме 
4. CONVENTIONAL ee ен ERE үк 
5.SHALLOW ee IMAGINA 
7.KIND | ic — = MEI PESSIMISTIC 
7. KIND mater say АЛС mM 
8. VALUABLE uM Эра m N рыс 
9. INSENSITIVE Іт = =ч ша 
10. SUSPICIOUS MEN ASS тын ыг SENSITIVE 
11. АССЕРТАМТ so тј = RUSTING 
12. OBSTRUCTIVE и ва лаа ша Е REJECTIN 
13. WARM EUM a t HELPF 
14. UNPLEASANT et SOT Say eee СОЕ COLD т 
15. COWARDLY = ae = PLEASA 
16. SHARP WEE Eu eer = OE BRAV 
17. TENSE CE DNA = ез = DULL Ер 
19. HUMBLE | e ques EN EE DISHONEST 
20. ACTIVE : uM ом qos MN. ASSERTIVE 
21. WEAK — EM эле, ger ON PASSIVE 
gum лыы à STRONG 
22. GOOD mmm NE EAE = E STRC 
23. RELIABLE с=с сы BD e 
24. RESERVED E UNRELIABLE 
25. UNIMPORTANT — = ты EE EASY-GOINC 


OCCUPATIONA 


APPENDIX B 


L VALUES INVENTORY: NORMATIVE FORM 


DEVELOPED BY ROBERT MCMORRIS 


DIRECTIONS FOR PARTN: 


People want a job where they 


семетьоым- 


аге favored by (һе boss 
don’t have to think 

receive kickbacks 

need little training 

can pilfer things on the job 
are arelative of the owner 
can expect financial favors 
are given political favoritism 
let others make decisions 


For each of the following statements regarding 
work values, please indicate how important you 
think it is for people generally. Indicate your choice 
in the columns to the right of the statement by 


arking an X. 
marking Moderately Very 
Unimportant Important Important 
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OCCUPATIONAL VALUESINVENTORY: IPSATIVEFORM 


INSTRUMENT DEVELOPMENTIN THE AFFECTIVE DOMAIN 


DEVELOPED BY ROBERTMCMORRIS 


DIRECTIONS FOR PARTI: For each of the following sets of three statements 
regarding work values. please pick the one which 
you think is most important to people generally and 
the one which is least important. Indicate your 
choices in the columns to the right of the statements 


by marking an X. 


People want a job where they 
10. аге favored by the boss 
11. don't have to think 

12. receive kickbacks 


People want a job where they 
13. need little training 

14. can pilfer things on the job 
15. аге а relative of the owner 


People want a job where they 
16. can expect financial favors 
17. are given political favoritism 
18. let others make decisions 


19. To what extent is social accept- 
ability important? 


Most 


BED RU 


Unimportant 


Least 


HOME UI 


Moderately 
Important 
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Instrument пренио s inthe Affective Deron wa s prepare 
for the affective component of a graduate level course in affec 
and cognitive instrument development. The techniques дезе 
and the data sets included, represent attempts to prep: ep 19 2 
rials that would illustrate proper instrument амы Ра S n 
niques in the affective domain. 


Chapter 1 discusses the importance of affective vest a е 
° presents conceptual definitions of major affective constructs. 
Chapter 2 outlines and illustrates the domain-referenced approach 

for developing operational definitions for the targeted conceptual ' 
definitions. Chapter 3 addresses the important area of scaling the 

affective characteristics in the context of Fishbein's expectancy- 

value model. The Thurstone, latent trait, Likert, and semantic 

differential techniques are included along with a section on norma- 

tive versus ipsative measures. Chapters 4 and 5 present the under- 

lying theory and empirical techniques appropriate for examining \ 
validity and reliability evidence. Data gathered by the author, Е 
using several different instruments, аге included to illustrate 

each technique. Decision strategies are discussed and models for 

reporting the data analysis are illustrated. Chapter 6 reviews the 

steps in the process of instrument development. 
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